New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows event notification #52

Open
njsmith opened this Issue Feb 12, 2017 · 13 comments

Comments

Projects
None yet
3 participants
@njsmith
Member

njsmith commented Feb 12, 2017

Problem

Windows has 3 incompatible families of event notifications APIs: IOCP, select/WSAPoll, and WaitForMultipleEvents-and-variants. They each have unique capabilities. This means: if you want to be able to react to all the different possible events that Windows can signal, then you must use all 3 of these. Needless to say, this creates a challenge for event loop design. There are a number of potentially viable ways to arrange these pieces; the question is which one we should use.

(Actually, all 3 together still isn't sufficient, b/c there are some things that still require threads – like console IO. But never mind. Just remember that when someone tells you that Windows' I/O subsystem is great, that their statement isn't wrong but does require taking a certain narrow perspective...)

Considerations

The WaitFor*Event family

The Event-related APIs are necessary to, for example, wait for a notification that a child process has exited. (The job object API provides a way to request IOCP notifications about process death, but the docs warn that the notifications are lossy and therefore useless...) Otherwise though they're very limited – in particular they have both O(n) behavior and max 64 objects in an interest set – so you definitely don't want to use these as your primary blocking call. We're going to be calling these in a background thread of some kind. The two natural architectures are to use WaitForSingleObject(Ex) and allocate one-thread-per-event, or else use WaitForMultipleObjects(Ex) and try and coalesce up to 64 events into each thread (substantially more complicated to implement but with 64x less memory overhead for thread stacks, if it matters). This is orthogonal to the rest of this issue, so it gets its own thread: #233

IOCP

IOCP is the crown jewel of Windows the I/O subsystem, and what you generally hear recommended. It follows a natively asynchronous model where you just go ahead and issue a read or write or whatever, and it runs in the background until eventually the kernel tells you it's done. It provides an O(1) notification mechanism. It's pretty slick. But... it's not as obvious a choice as everyone makes it sound. (Did you know the Chrome team has mostly given up on trying to make it work?)

Issues:

  • When doing a UDP send, the send is only notified as complete once the packet hits the wire; i.e., using IOCP for UDP totally removes in-kernel buffering/flow-control. So to get decent throughput you must implement your own buffering system allowing multiple UDP sends to be in flight at once (but not too many because you don't want to introduce arbitrary latency). Or you could just use the non-blocking API and the kernel worries about this for you. (This hit Chrome hard; they switched to using non-blocking IO for UDP on Windows. ref1, ref2.)

  • When doing a TCP receive with a large buffer, apparently the kernel does a Nagle-like thing where it tries to hang onto the data for a while before delivering it to the application, thus introducing pointless latency. (This also bit Chrome hard; they switched to using non-blocking IO for TCP receive on Windows. ref1, ref2)

  • Sometimes you really do want to check whether a socket is readable before issuing a read: in particular, apparently outstanding IOCP receive buffers get pinned into kernel memory or some such nonsense, so it's possible to exhaust system resources by trying to listen to a large number of mostly-idle sockets.

  • Sometimes you really do want to check whether a socket is writable before issuing a write: in particular, because it allows adaptive protocols to provide lower latency if they can delay deciding what bytes to write until the last moment.

  • Python provides a complete non-blocking API out-of-the-box, and we use this API on other platforms, so using non-blocking IO on Windows as well is much MUCH simpler for us to implement than IOCP, which requires us to pretty much build our own wrappers from scratch.

On the other hand, IOCP is the only way to do a number of things like: non-blocking IO to the filesystem, or monitoring the filesystem for changes, or non-blocking IO on named pipes. (Named pipes are popular for talking to subprocesses – though it's also possible to use a socket if you set it up right.)

select/WSAPoll

You can also use select/WSAPoll. This is the only documented way to check if a socket is readable/writable. However:

  • As is well known, these are O(n) APIs, which sucks if you have lots of sockets. It's not clear how much it sucks exactly -- just copying the buffer into kernel-space probably isn't a big deal for realistic interest set sizes -- but clearly it's not as nice as O(1). On my laptop, select.select on 3 sets of 512 idle sockets takes <200 microseconds, so I don't think this will, like, immediately kill us. Especially since people mostly don't run big servers on Windows? OTOH an empty epoll on the same laptop returns in ~0.6 microseconds, so there is some difference...

  • select.select is limited to 512 sockets, but this is trivially overcome; the Windows fd_set structure is just a array of SOCKETs + a length field, which you can allocate in any size you like (#3). (This is a nice side-effect of Windows never having had a dense fd space. This also means WSAPoll doesn't have much reason to exist. Unlike other platforms where poll beats select because poll uses an array and select uses a bitmap, WSAPoll is not really any more efficient than select. Its only advantage is that it's similar to how poll works on other platforms... but it's gratuitously incompatible. The one other interesting feature is that you can do an alertable wait with it, which gives a way to cancel it from another thread without using an explicit wakeup socket, via QueueUserAPC.)

  • Non-blocking IO on windows is apparently a bit inefficient because it adds an extra copy. (I guess they don't have zero-copy enqueueing of data to receive buffers? And on send I guess it makes sense that you can do that legitimately zero-copy with IOCP but not with nonblocking, which is nice.) Again I'm not sure how much this matters given that we don't have zero-copy byte buffers in Python to start with, but it's a thing.

  • select only works for sockets; you still need IOCP etc. for responding to other kinds of notifications.

Options

Given all of the above, our current design is a hybrid that uses select and non-blocking IO for sockets, with IOCP available when needed. We run select in the main thread, and IOCP in a worker thread, with a wakeup socket to notify when IOCP events occur. This is vastly simpler than doing it the other way around, because you can trivially queue work to an IOCP from any thread, while if you want to modify select's interest set from another thread it's a mess. As an initial design, this makes a lot of sense, because it allows us to provide full features (including e.g. wait_writable for adaptive protocols), avoid the tricky issues that IOCP creates for sockets, and requires a minimum of special code.

The other attractive option would be if we could solve the issues with IOCP and switch to using it alone – this would be simpler and get rid of the O(n) select. However, as we can see above, there are a whole list of challenges that would need to be overcome first.

Working around IOCP's limitations

UDP sends

I'm not really sure what the best approach here is. One option is just to limit the number of outstanding UDP data to some fixed amount (maybe tunable through a "virtual" (i.e. implemented by us) sockopt), and drop packets or return errors if we exceed that. This is clearly solvable in principle, it's just a bit annoying to figure out the details.

Spurious extra latency in TCP receives

I think that using the MSG_PUSH_IMMEDIATE flag should solve this.

Checking readability / writability

It turns out that IOCP actually can check readability! It's not mentioned on MSDN at all, but there's a well-known bit of folklore about the "zero-byte read". If you issue a zero-byte read, it won't complete until there's data ready to read. ref1 (← official MS docs!), ref2, ref3.

That's for SOCK_STREAM sockets. What about SOCK_DGRAM? libuv does zero-byte reads with MSG_PEEK set (to avoid consuming the packet, truncating it to zero bytes in the process). MSDN explicitly says that this doesn't work (MSG_PEEK and overlapped IO supposedly don't work together), but I guess I trust libuv more than MSDN? I don't 100% trust either – this would need to be verified.

What about writability? Empirically, if you have a non-blocking socket on windows with a full send buffer and you do a zero-byte send, it returns EWOULDBLOCK. (This is weird; other platforms don't do this.) If this behavior also translates to IOCP sends, then this zero-byte send trick would give us a way to use IOCP to check writability on SOCK_STREAM sockets.

For writability of SOCK_DGRAM I don't think there's any trick, but it's not clear how meaningful SOCK_DGRAM writability is anyway. If we do our own buffering than presumably we can implement it there.

Alternatively, there is a remarkable piece of undocumented sorcery, where you reach down directly to make syscalls, bypassing the Winsock userland, and apparently can get OVERLAPPED notifications when a socket is readable/writable: ref1, ref2, ref3, ref4, ref5. I guess this is how select is implemented? The problem with this is that it only works if your sockets are implemented directly in the kernel, which is apparently not always the case (because of like... antivirus tools and other horrible things that can interpose themselves into your socket API). So I'm inclined to discount this as unreliable. [Edit: or maybe not, see below]

Implementing all this junk

I actually got a ways into this. Then I ripped it out when I realized how many nasty issues there were beyond just typing in long and annoying API calls. But it could easily be resurrected; see 7e7a809 and its parent.

TODO

If we do want to switch to using IOCP in general, then the sequence would go something like:

  • check whether zero-byte sends give a way to check TCP writability via IOCP – this is probably the biggest determinant of whether going to IOCP-only is even possible (might be worth checking what doing UDP sends with MSG_PARTIAL does too while we're at it)
  • check whether you really can do zero-byte reads on UDP sockets like libuv claims
  • figure out what kind of UDP send buffering strategy makes sense (or if we decide that UDP sends can just drop packets instead of blocking then I guess the non-blocking APIs remain viable even if we can't do wait_socket_writable on UDP sockets)

At this point we'd have the information to decide whether we can/should go ahead. If so, then the plan would look something like:

  • migrate away from select for the cases that can't use IOCP readable/writable checking: [Not necessary, AFD-based select should work for these too]
    • connect
    • accept
  • implement wait_socket_readable and wait_socket_writable on top of IOCP and get rid of select (but at this point we're still doing non-blocking I/O on sockets, just using IOCP as a select replacement)
  • (optional / someday) switch to using IOCP for everything instead of non-blocking I/O

New plan:

  • Use the tricks from the thread below to reimplement wait_socket_{readable,writable} using AFD, and confirm it works
  • Add LSP testing to our Windows CI
  • Consider whether we want to switch to using IOCP in more cases, e.g. send/recv. Not sure it's worth bothering.
@njsmith

This comment has been minimized.

Member

njsmith commented Mar 10, 2017

Something to think about in this process: according to the docs on Windows, with overlapped I/o you can still get WSAEWOULDBLOCK ("too many outstanding overlapped requests"). No-one on the internet seems to have any idea when this actually occurs or why. Twisted has a FIXME b/c they don't handle it, just propagate the error out. Maybe we can do better? Or maybe not.

(I bet if you don't throttle UDP sends at all then that's one way to get this? Though libuv AFAICT doesn't throttle UDP sends at all and I guess they do ok... it'd be worth trying this just to see how Windows handles it.)

@njsmith

This comment has been minimized.

Member

njsmith commented Mar 7, 2018

Small note to remember: if we start using GetQueuedCompletionStatusEx directly with timeouts, then we'll need to make sure that we're careful about how we round those timeouts: https://bugs.python.org/issue20311

@njsmith

This comment has been minimized.

Member

njsmith commented Apr 19, 2018

Hmm, someone on stackoverflow says that they wrote a test program to see if zero-byte IOCP sends could be used to check for writability, and that it didn't work for them :-/

@tmm1

This comment has been minimized.

tmm1 commented Sep 18, 2018

That was me, and afaict the IOCP 0-byte trick only works for readability.

According to https://blog.grijjy.com/2018/08/29/creating-high-performance-udp-servers-on-windows-and-linux/ the 0-byte read trick does with MSG_PEEK (even though the msdn docs say it only works for non-overlapped sockets?).

To initiate a zero-byte read operation for UDP, you simply pass an overlapped IO event with an empty buffer. The key is to include the MSG_PEEK flag in the WsaRecv() overlapped api call. This signals the underlying completion logic to raise an overlapped event, but not to pass any data.

@njsmith

This comment has been minimized.

Member

njsmith commented Sep 22, 2018

@tmm1 Hello! 👋 Thank you for that :-).

Looking at this again 18 months later, and with that information in mind... there's still no urgent need to change what we're doing, though if we move forward with some of the wilder ideas in #399 then that might make IOCP more attractive (in particular see #399 (comment)).

How bad would it be to make wait_socket_writable on Windows be unimplemented, or a weird expensive thing that exists but we generally don't use (e.g., it could call select in a thread, like how WaitForSingleObject is implemented)? On Windows, SocketStream.wait_send_all_might_not_block could just return immediately, for a small degradation in UX, or maybe we'll remove wait_send_all_might_not_block anyway (#371). Another use case is wrapping external libraries. ZMQ requires wait_socket_readable, but doesn't need wait_socket_writable. The postgres client library libpq's async mode absolutely assumes that you have wait_socket_writable, unfortunately (e.g.), so that limitation is inherited by psycopg2 and wrappers like @Fuyukai's riopg. Another user is @agronholm's hyperio. Given hyperio's goals and that asyncio also doesn't guarantee any way to wait for readable/writable sockets on Windows, I suspect hyperio would be better off building on top of asyncio's protocols/transports and Trio's Stream layer, which would avoid this issue, but for now that's not what it does.

I guess we could dig into the "undocumented sorcery" mentioned above and see how bad it is...

@njsmith

This comment has been minimized.

Member

njsmith commented Sep 26, 2018

I guess we could dig into the "undocumented sorcery" mentioned above and see how bad it is...

I got curious about this and did do a bit more digging.

So when you call select on Windows, that's a function in ws2_32.dll. I haven't found the source code for this available anywhere (though I haven't checked whether it comes with visual studio or anything... the CRT source code does, so it's possible winsock does as well).

Normally, running on standard out-of-the-box install of Windows, select then uses the "AFD" kernel subsystem to actually do what it needs to do. This works by opening a magic filename, and sending requests using the magic DeviceIoControl function (which is essentially equivalent of Unix ioctl, i.e., "shove some arbitrary requests into the kernel using a idiosyncratic format, I hope you know what you're doing").

That part is actually fine. It's arcane and low-level, but documented well-enough in the source code of projects like libuv and wepoll, and basically sensible. You put together an array of structs listing the sockets you want to wait for and what events you want to wait for, and then submit it to the kernel, and get back a regular IOCP notification when it's ready.

However! This is not the only thing select can do. There is also something called "layered service providers". The idea here is that ws2_32.dll allows random userspace dlls to interpose themselves on socket calls, and there's a whole elaborate system for how these dlls can intercept calls on the way in, pass them down to the lower level, pass them back out again, etc. The best article online about how this works seems to be Unraveling the Mysteries of Writing a Winsock 2 Layered Service Provider from the MS Systems Journal May 1999, which is no longer available from MS but is still available through sketchy mirror sites (henceforth: UtMoWaW2LSP). Or, there's a book from 2002 called Network Programming for Microsoft Windows, which has a chapter on this, and might be available from your favorite ebook pirate site, not that you have such a thing (henceforth: NPfMW).

Like most unstructured hooking APIs, the designers had high hopes that this would be used for all kinds of fancy things, but in practice it suffers from the deadly flaw of lack of composability:

LSPs have tremendous potential for value-added networking services. But the current Winsock 2 specification does not answer an important question: where to insert an LSP in the protocol chain if there's another one installed [...] the widespread success of an LSP is effectively prohibited because the only safe installation is an LSP over a base provider and making the new chain the default provider for the protocol. Such an approach guarantees the service of the new LSP, but removes the existing LSP as the default provider in the chain. – UtMoWaW2LSP

And in fact, select is particularly broken, because it gets passed multiple sockets at the same time. And with the way the LSP hooking interface works, different sockets might have different hooks installed. So what happens if you pass sockets belonging to two different LSPs into the same select call? How does ws2_32.dll handle this?

Apparently it picks one of the hooks at random and forces it to handle all of the sockets. Which can't possibly work, but nonetheless:

As a consequence of the architecture of Winsock 2 service providers, select in user applications can only correctly handle sockets from the same service provider in a single call. – UtMoWaW2LSP

Even though the Winsock specification explicitly states that only sockets from the same provider may be passed into select, many applications ignore this [well obviously, what else are they supposed to do? -njs] and frequently pass down both TCP and UDP handles together. The base Microsoft providers do not verify that all socket handles are from the same provider. In addition, the Microsoft providers will correctly handle sockets from multiple providers in a single select call. This is a problem with LSPs because an LSP may be layered over just a single entry such as TCP. In this case, the LSP's WSPSelect is invoked with FD_SETs that contain their own sockets plus sockets from other providers (such as a UDP socket from the Microsoft provider). When the LSP is translating the socket handles and comes upon the UDP handle, the context query will fail. At this point, it may return an error (WSAENOTSOCK) or pass the socket down unmodified. If an error is returned, then for the case of an LSP layered only over UDP/IPv4 (or TCP/IPv4), Internet Explorer will no longer function. A workaround is to always install the LSP over all providers for a given address family (such as for IPv4, install over TCP/IPv4, UDP/IPv4, and RAW/IPv4). No Microsoft application or service currently passes socket handles from multiple address families into a single select call, although LSASS on Windows NT 4.0 used to pass IPX and IPv4 sockets together (this has been fixed in the latest service packs for Windows NT 4.0). – NPfMW

Also, the LSP design means that you can inject arbitrary code into any process by modifying the registry keys that configure the LSPs. This tends to break sandboxes, and means that buggy LSPs cause crashes critical system processes like winlogon.exe.

So this whole design is fundamentally broken, and MS is actively trying to get rid of it: as of Windows 8 / Window Server 2012, it's officially deprecated, and to help nudge people away from it, Windows 8 "Metro" apps simply ignore any installed LSPs.

Nonetheless, it is apparently used by old firewalls and anti-virus programs and also malware. (I guess they must just blindly hook everything, to work around the composability problems?), so some percentage of Windows machines in the wild do still have LSPs installed.

Another fun aspect of LSPs: there are some functions that are not "officially" part of Winsock, but are rather "Microsoft extensions", like ConnectEx (the IOCP version of connect), TransmitFile (the Windows version of sendfile, see #45), etc. Given that MS owns the Winsock standard I have no idea what the distinction here is supposed to be, but I guess it made sense to someone at some point. Anyway, the extension functions have a funny API: there isn't a public function in ws2_32.dll you can just call; instead, you have to first call WSAIoctl with the magic code SIO_GET_EXTENSION_FUNCTION_POINTER, and that returns a pointer to ConnectEx or whatever. How does that interact with LSPs? Well, WSAIoctl actually takes a socket as an argument. So there's no way for LSPs to hook the normal ConnectEx; instead, LSPs are allowed to simply replace the normal ConnectEx on their sockets, by causing WSAIoctl to return a different function pointer entirely.

Twisted and asyncio both get this wrong, btw – they call WSAIoctl once-per-process, and assume that the function pointer that's returned can be used for any socket. But this is wrong – you're actually supposed to re-fetch the function pointers for every socket you want to call them on. Libuv does this correctly. It's interesting that Twisted and asyncio seem to get away with this, though.

You can detect whether an LSP has hooked any particular socket by using getsockopt(socket, SOL_SOCKET, SO_PROTOCOL_INFOW, ...), and then checking protocol_info->ProviderId, which is a GUID labeling the which "provider" handles this socket; the built-in MS providers have well-known GUIDs. (libuv checks this: see uv_msafd_provider_ids, uv__fast_poll_get_peer_socket, etc.).

You can also blatantly violate the whole LSP abstraction layer by passing SIO_BASE_HANDLE to WSAIoctl to fetch the "base service provider" socket, i.e., the one that's at the bottom of the whole LSP stack. In fact, AFAICT libuv actually does this unconditionally, and then uses the "base handle" for all future socket-related calls, including sending and receiving data. So I guess all libuv apps just... ignore LSPs entirely?

Given that libuv seems to get away with this in practice, and Windows 8 Metro apps do something like this, and that Twisted and asyncio are buggy in the presence of LSPs and seem to get away with it in practice, I think we can probably use SIO_BASE_HANDLE and then pretend LSPs don't exist?

But... having used SIO_BASE_HANDLE to get a "base provider" handle, are you guaranteed to be able to use the AFD magic stuff? What is a "base provider" anyway?

NPfMW says:

A base provider exposes a Winsock interface that directly implements a protocol such as the Microsoft TCP/IP provider.

Typically, a base provider has a kernel mode protocol driver associated with it. For example, the Microsoft TCP and UPD providers require the TCP/IP driver TCPIP.SYS to ultimately function. It is also possible to develop your own base providers, but that is beyond the scope of this book. For more information about base providers, consult the Windows Driver Development Kit (DDK).

On a Windows system, you can get a list of "providers" (both LSPs and base providers) by running netsh winsock show catalog. Here's the output on a Win 10 VM (from modern.ie):

C:\Users\IEUser
λ netsh winsock show catalog

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider
Description:                        Hyper-V RAW
Provider ID:                        {1234191B-4BF7-4CA7-86E0-DFD7C32B5445}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1001
Version:                            2
Address Family:                     34
Max Address Length:                 36
Min Address Length:                 36
Socket Type:                        1
Protocol:                           1
Service Flags:                      0x20026
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider
Description:                        MSAFD Tcpip [TCP/IP]
Provider ID:                        {E70F1AA0-AB8B-11CF-8CA3-00805F48A192}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1006
Version:                            2
Address Family:                     2
Max Address Length:                 16
Min Address Length:                 16
Socket Type:                        1
Protocol:                           6
Service Flags:                      0x20066
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider
Description:                        MSAFD Tcpip [UDP/IP]
Provider ID:                        {E70F1AA0-AB8B-11CF-8CA3-00805F48A192}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1007
Version:                            2
Address Family:                     2
Max Address Length:                 16
Min Address Length:                 16
Socket Type:                        2
Protocol:                           17
Service Flags:                      0x20609
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider
Description:                        MSAFD Tcpip [RAW/IP]
Provider ID:                        {E70F1AA0-AB8B-11CF-8CA3-00805F48A192}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1008
Version:                            2
Address Family:                     2
Max Address Length:                 16
Min Address Length:                 16
Socket Type:                        3
Protocol:                           0
Service Flags:                      0x20609
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider
Description:                        MSAFD Tcpip [TCP/IPv6]
Provider ID:                        {F9EAB0C0-26D4-11D0-BBBF-00AA006C34E4}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1009
Version:                            2
Address Family:                     23
Max Address Length:                 28
Min Address Length:                 28
Socket Type:                        1
Protocol:                           6
Service Flags:                      0x20066
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider
Description:                        MSAFD Tcpip [UDP/IPv6]
Provider ID:                        {F9EAB0C0-26D4-11D0-BBBF-00AA006C34E4}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1010
Version:                            2
Address Family:                     23
Max Address Length:                 28
Min Address Length:                 28
Socket Type:                        2
Protocol:                           17
Service Flags:                      0x20609
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider
Description:                        MSAFD Tcpip [RAW/IPv6]
Provider ID:                        {F9EAB0C0-26D4-11D0-BBBF-00AA006C34E4}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1011
Version:                            2
Address Family:                     23
Max Address Length:                 28
Min Address Length:                 28
Socket Type:                        3
Protocol:                           0
Service Flags:                      0x20609
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider
Description:                        MSAFD Irda [IrDA]
Provider ID:                        {3972523D-2AF1-11D1-B655-00805F3642CC}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1012
Version:                            2
Address Family:                     26
Max Address Length:                 32
Min Address Length:                 8
Socket Type:                        1
Protocol:                           1
Service Flags:                      0x20006
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider
Description:                        RSVP TCPv6 Service Provider
Provider ID:                        {9D60A9E0-337A-11D0-BD88-0000C082E69A}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1017
Version:                            2
Address Family:                     23
Max Address Length:                 28
Min Address Length:                 28
Socket Type:                        1
Protocol:                           6
Service Flags:                      0x22066
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider
Description:                        RSVP TCP Service Provider
Provider ID:                        {9D60A9E0-337A-11D0-BD88-0000C082E69A}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1018
Version:                            2
Address Family:                     2
Max Address Length:                 16
Min Address Length:                 16
Socket Type:                        1
Protocol:                           6
Service Flags:                      0x22066
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider
Description:                        RSVP UDPv6 Service Provider
Provider ID:                        {9D60A9E0-337A-11D0-BD88-0000C082E69A}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1019
Version:                            2
Address Family:                     23
Max Address Length:                 28
Min Address Length:                 28
Socket Type:                        2
Protocol:                           17
Service Flags:                      0x22609
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider
Description:                        RSVP UDP Service Provider
Provider ID:                        {9D60A9E0-337A-11D0-BD88-0000C082E69A}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1020
Version:                            2
Address Family:                     2
Max Address Length:                 16
Min Address Length:                 16
Socket Type:                        2
Protocol:                           17
Service Flags:                      0x22609
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider (32)
Description:                        Hyper-V RAW
Provider ID:                        {1234191B-4BF7-4CA7-86E0-DFD7C32B5445}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1001
Version:                            2
Address Family:                     34
Max Address Length:                 36
Min Address Length:                 36
Socket Type:                        1
Protocol:                           1
Service Flags:                      0x20026
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider (32)
Description:                        MSAFD Tcpip [TCP/IP]
Provider ID:                        {E70F1AA0-AB8B-11CF-8CA3-00805F48A192}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1006
Version:                            2
Address Family:                     2
Max Address Length:                 16
Min Address Length:                 16
Socket Type:                        1
Protocol:                           6
Service Flags:                      0x20066
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider (32)
Description:                        MSAFD Tcpip [UDP/IP]
Provider ID:                        {E70F1AA0-AB8B-11CF-8CA3-00805F48A192}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1007
Version:                            2
Address Family:                     2
Max Address Length:                 16
Min Address Length:                 16
Socket Type:                        2
Protocol:                           17
Service Flags:                      0x20609
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider (32)
Description:                        MSAFD Tcpip [RAW/IP]
Provider ID:                        {E70F1AA0-AB8B-11CF-8CA3-00805F48A192}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1008
Version:                            2
Address Family:                     2
Max Address Length:                 16
Min Address Length:                 16
Socket Type:                        3
Protocol:                           0
Service Flags:                      0x20609
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider (32)
Description:                        MSAFD Tcpip [TCP/IPv6]
Provider ID:                        {F9EAB0C0-26D4-11D0-BBBF-00AA006C34E4}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1009
Version:                            2
Address Family:                     23
Max Address Length:                 28
Min Address Length:                 28
Socket Type:                        1
Protocol:                           6
Service Flags:                      0x20066
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider (32)
Description:                        MSAFD Tcpip [UDP/IPv6]
Provider ID:                        {F9EAB0C0-26D4-11D0-BBBF-00AA006C34E4}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1010
Version:                            2
Address Family:                     23
Max Address Length:                 28
Min Address Length:                 28
Socket Type:                        2
Protocol:                           17
Service Flags:                      0x20609
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider (32)
Description:                        MSAFD Tcpip [RAW/IPv6]
Provider ID:                        {F9EAB0C0-26D4-11D0-BBBF-00AA006C34E4}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1011
Version:                            2
Address Family:                     23
Max Address Length:                 28
Min Address Length:                 28
Socket Type:                        3
Protocol:                           0
Service Flags:                      0x20609
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider (32)
Description:                        MSAFD Irda [IrDA]
Provider ID:                        {3972523D-2AF1-11D1-B655-00805F3642CC}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1012
Version:                            2
Address Family:                     26
Max Address Length:                 32
Min Address Length:                 8
Socket Type:                        1
Protocol:                           1
Service Flags:                      0x20006
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider (32)
Description:                        RSVP TCPv6 Service Provider
Provider ID:                        {9D60A9E0-337A-11D0-BD88-0000C082E69A}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1017
Version:                            2
Address Family:                     23
Max Address Length:                 28
Min Address Length:                 28
Socket Type:                        1
Protocol:                           6
Service Flags:                      0x22066
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider (32)
Description:                        RSVP TCP Service Provider
Provider ID:                        {9D60A9E0-337A-11D0-BD88-0000C082E69A}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1018
Version:                            2
Address Family:                     2
Max Address Length:                 16
Min Address Length:                 16
Socket Type:                        1
Protocol:                           6
Service Flags:                      0x22066
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider (32)
Description:                        RSVP UDPv6 Service Provider
Provider ID:                        {9D60A9E0-337A-11D0-BD88-0000C082E69A}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1019
Version:                            2
Address Family:                     23
Max Address Length:                 28
Min Address Length:                 28
Socket Type:                        2
Protocol:                           17
Service Flags:                      0x22609
Protocol Chain Length:              1

Winsock Catalog Provider Entry
------------------------------------------------------
Entry Type:                         Base Service Provider (32)
Description:                        RSVP UDP Service Provider
Provider ID:                        {9D60A9E0-337A-11D0-BD88-0000C082E69A}
Provider Path:                      %SystemRoot%\system32\mswsock.dll
Catalog Entry ID:                   1020
Version:                            2
Address Family:                     2
Max Address Length:                 16
Min Address Length:                 16
Socket Type:                        2
Protocol:                           17
Service Flags:                      0x22609
Protocol Chain Length:              1

(This only has the service providers; the command also lists "name space providers", which is a different extension mechanism, but I left those out.)

So if we group by entries with the same GUID, there's one base provider that does IPv4 (TCP, UDP, and RAW), another that does IPv6 (TCP, UDP, and RAW), one that handles Hyper-V sockets, one that handles IrDA sockets, and one that handles "RSVP" (whatever that is).

The three special GUIDs that libuv thinks can work with the AFD magic seem to correspond to the IPv4 and IPv6 providers, and then the third one from some googling appears to MS's standard Bluetooth provider. (I guess this VM doesn't have the Bluetooth drivers installed.)

So: I think libuv on Windows always ends up using the AFD magic when it's working IPv4 or IPv6 or Bluetooth sockets. When it gets something else – meaning IrDA, Hyper-V, RSVP, or something even more exotic involving a third-party driver – it falls back on calling select in a thread, similar to how we implement WaitForSingleObject.

But AFAIK, on Windows, Python's socket module only supports IPv4 and IPv6 anyway. And I actually kinda suspect that IrDA and Hyper-V use AFD too, since they're official microsoft-maintained providers.

Final conclusion

I think we can use SIO_BASE_HANDLE + the freaky AFD magic to implement wait_socket_writable on top of IOCP, with no fallback needed.

@piscisaureus

This comment has been minimized.

piscisaureus commented Sep 27, 2018

@njsmith

I randomly stumbled upon this message. Since this is stuff is so arcane that it's actually kinda fun, some extra pointers:

So when you call select on Windows, that's a function in ws2_32.dll. I haven't found the source code for this available anywhere (though I haven't checked whether it comes with visual studio or anything... the CRT source code does, so it's possible winsock does as well).

cough

https://github.com/pustladi/Windows-2000/blob/661d000d50637ed6fab2329d30e31775046588a9/private/net/sockets/winsock2/wsp/msafd/select.c#L59-L655

https://github.com/metoo10987/WinNT4/blob/f5c14e6b42c8f45c20fe88d14c61f9d6e0386b8e/private/ntos/afd/poll.c#L68-L707
(Note that this source code is very old. I have reasons to believe some things have changed significantly.)

The three special GUIDs that libuv thinks can work with the AFD magic seem to correspond to the IPv4 and IPv6 providers, and then the third one from some googling appears to MS's standard Bluetooth provider. (I guess this VM doesn't have the Bluetooth drivers installed.)

This is actually kind of a bug in libuv -- this logic makes libuv use it's "slow" mode too eagerly (libuv has a fallback mechanism where it simply calls select() in a thread). If a socket is bound to an IFS LSP, it'll have a different GUID that's not one of those 3 on the whitelist, however the socket would still work with IOCTL_AFD_POLL. Since all base service providers are built into windows and they all use AFD, it's simply enough to check whether the protocol chain length (in the WSAProtocol_InfoW structure) equals 1.

I think we can use SIO_BASE_HANDLE + the freaky AFD magic to implement wait_socket_writable on top of IOCP, with no fallback needed.

Note that windows itself actually does this most of the time: since windows vista, select() cannot be intercepted by LSPs unless the LSP does all of the following:

  • It's a non-IFS lsp
  • it hooks the SIO_BSP_HANDLE_SELECT ioctl
  • it installs itself over all protocols.

I have yet to encounter the first LSP that actually does this.
See https://blogs.msdn.microsoft.com/wndp/2006/07/13/the-new-trouble-with-select-and-lsps/

If you wanted to be super pedantic about it, you could call both SIO_BASE_HANDLE and SIO_BSP_HANDLE_SELECT and check if they are the same.

So if we group by entries with the same GUID, there's one base provider that does IPv4 (TCP, UDP, and RAW), another that does IPv6 (TCP, UDP, and RAW), one that handles Hyper-V sockets, one that handles IrDA sockets, and one that handles "RSVP" (whatever that is).

There's really no need to do any of this grouping in practice, they all use the same code path in the AFD driver. I removed it from wepoll in piscisaureus/wepoll@e7e8385.

Also note that RSVP is deprecated and no longer supported, and that since the latest windows 10 update there is AF_UNIX which also works with IOCTL_AFD_POLL just fine.

@njsmith

This comment has been minimized.

Member

njsmith commented Sep 28, 2018

@piscisaureus Oh hey, thanks for dropping by! I was wondering why wepoll didn't seem to check GUIDs, and was completely baffled by the docs for SIO_BSP_HANDLE_SELECT... that clarifies a lot!

What do you think of unconditionally calling SIO_BASE_HANDLE and using that instead of the original socket for everything, the way libuv seems to? I guess this will probably have the effect of circumventing any LSPs, but... will anyone care? It seems like it might be simpler than messing with SIO_BSP_HANDLE_SELECT...

And do you happen to know what happens if you try to use the AFD poll stuff with a handle that isn't a MSAFD base provider handle? Does it give a sensible error, or...?

@piscisaureus

This comment has been minimized.

piscisaureus commented Sep 30, 2018

What do you think of unconditionally calling SIO_BASE_HANDLE and using that instead of the original socket for everything, the way libuv seems to?

Libuv doesn't do that. It uses SIO_BASE_HANDLE to determine whether it can use some optimizations, but it doesn't attempt to bypass LSPs.

If you want to bypass LSPs outright, then you should do it when you create the socket. This is done by using WSASocketW() instead of socket() and passing it the appropriate WSAPROTOCOL_INFOW structure associated with the base service provider.

When you create an non-IFS-LSP-bound socket and then get and use the base handle everywhere, it would likely cause memory leaks and other weirdness (in particular if you also use the base handle when closing the socket with closesocket()).

I guess this will probably have the effect of circumventing any LSPs, but... will anyone care?

Not too many people will care probably. The only LSP i've seen people use intentionally is Proxifier. That's an IFS LSP, so there's no need to bypass it, the socket is a valid AFD handle.

And do you happen to know what happens if you try to use the AFD poll stuff with a handle that isn't a MSAFD base provider handle? Does it give a sensible error, or...?

You'll get a STATUS_INVALID_HANDLE error.
https://github.com/metoo10987/WinNT4/blob/f5c14e6b42c8f45c20fe88d14c61f9d6e0386b8e/private/ntos/afd/poll.c#L155

@njsmith

This comment has been minimized.

Member

njsmith commented Oct 1, 2018

Libuv doesn't do that. It uses SIO_BASE_HANDLE to determine whether it can use some optimizations, but it doesn't attempt to bypass LSPs.

Oh, you're right. I was looking at uv_poll_init, which does use SIO_BASE_HANDLE for everything, but uv_poll_t is only used for checking readability/writability – normally libuv networking is through the uv_tcp_t and uv_udp_t paths, which don't care about this stuff because they just use IOCP send/receive.

It looks like the only place uv_tcp_t uses SIO_BASE_HANDLE is when calling CancelIo. Odd, I wonder why they do that. Hmm, and it looks like it was added in libuv/libuv@6e8eb33 by... @piscisaureus :-). Do you remember why you did it that way? ...oh ugh, is it because LSP's can't override CancelIo, so it just fails on any non-IFS handle?

If you want to bypass LSPs outright, then you should do it when you create the socket. This is done by using WSASocketW() instead of socket() and passing it the appropriate WSAPROTOCOL_INFOW structure associated with the base service provider.

When you create an non-IFS-LSP-bound socket and then get and then use the base handle everywhere, it would likely cause memory leaks and other weirdness (in particular if you also use the base handle when closing the socket with closesocket()).

I see, excellent points, thank you. And on further thought, it probably wouldn't have worked anyway, since for the sockets we control we can use IOCP send; the sockets where we have to poll for writability are the ones from some 3rd-party library like libpq where we can't control what they do anyway.

I guess the main reason this seemed attractive is that it would let us have a single code path, without treating LSP sockets differently. I'm kind of terrified of that, because I have no idea how to test the LSP code paths. Do you by chance happen to know any reliable way to set up LSPs for testing? ("OK, so to make sure we're testing our code under realistic conditions, we're going to install malware on all our Windows test systems...")

Though, it's also probably not a huge deal, because it sounds like we can have a single straight-line code path in all cases anyway, where we do the SIO_BASE_HANDLE dance just when checking for writability, and otherwise use whatever random socket the OS gives us.

Thanks for all the information, by the way, this is super helpful.

@piscisaureus

This comment has been minimized.

piscisaureus commented Oct 2, 2018

It looks like the only place uv_tcp_t uses SIO_BASE_HANDLE is when calling CancelIo. Do you remember why you did it that way? ...oh ugh, is it because LSP's can't override CancelIo, so it just fails on any non-IFS handle?

Indeed CancelIo/CancelIoEx doesn't work with non IFS sockets. IIRC libuv also works around issues with SetFileCompletionNotificationModes not working in the presence of non-IFS LSPs. Come to think of it, I'm kind of surprised that CreateIoCompletionPort does work, it seems unlikely that LSPs can trap that?

I see, excellent points, thank you. And on further thought, it probably wouldn't have worked anyway, since for the sockets we control we can use IOCP send; the sockets where we have to poll for writability are the ones from some 3rd-party library like libpq where we can't control what they do anyway.

This was also the main reason we added uv_poll -- to integrate with c-ares (dns library).
Although I think if I were to do it again today, I'd have used it also for uv_tcp/uv_udp so the different backends could share a lot more code.

I'm kind of terrified of that, because I have no idea how to test the LSP code paths. Do you by chance happen to know any reliable way to set up LSPs for testing? ("OK, so to make sure we're testing our code under realistic conditions, we're going to install malware on all our Windows test systems...")

https://github.com/piscisaureus/wepoll/blob/0857889a0fccdba62654e11ae743dd96d85cc711/.appveyor.yml#L73-L95

Proxifier, as mentioned earlier, is an IFS LSP I have seen in the wild.
The PC Tools thing contains a non-IFS. The product itself is obsolete. I think it actually breaks the internet on appveyor but I can still test on the loopback interface.

IIRC older versions of the Windows SDK also contained a reference implementation for different LSP types. You could compile and install it. I've never gotten around to doing that.

@njsmith

This comment has been minimized.

Member

njsmith commented Oct 2, 2018

Indeed CancelIo/CancelIoEx doesn't work with non IFS sockets. IIRC libuv also works around issues with SetFileCompletionNotificationModes not working in the presence of non-IFS LSPs. Come to think of it, I'm kind of surprised that CreateIoCompletionPort does work, it seems unlikely that LSPs can trap that?

Looking again at Network Programming for Microsoft Windows, 2nd ed., I think the way LSPs handle IO completion ports is that when you call CreateIoCompletionPort, then ws2_32.dll somehow gets notified and it internally stores an association between the LSP socket and the relevant completion port. Then when a LSP intercepts a call to WSASend or whatever, it's supposed to notice that there's a non-NULL OVERLAPPED, and then when the operation completes it calls another magic ws2_32.dll function called WPUCompleteOverlappedRequest, which looks up the appropriate IO completion port / event object / etc. and handles the notification delivery. So... that's fine. But CancelIo didn't exist yet when this was written, and I can't see any they could have retrofitted it in besides the SIO_BASE_HANDLE hack that libuv's already using.

That said, I have no idea why this doesn't work with FILE_SKIP_COMPLETION_PORT_ON_SUCCESS. It's true that there's no reasonable way for WPUCompleteOverlappedRequest to know whether the operation succeeded immediately (and in fact the sample LSP in that book always fails all overlapped operations with WSA_IO_PENDING because it makes the code simpler). So they can't skip notifications just when they're redundant – either they have to skip all notifications, even when they're necessary, or they have to deliver all notifications, even when they're redundant. And apparently they chose the first option? Bizarre.

This was also the main reason we added uv_poll -- to integrate with c-ares (dns library). Although I think if I were to do it again today, I'd have used it also for uv_tcp/uv_udp so the different backends could share a lot more code.

That is definitely what we're going to do in Trio, at least to start :-).

https://github.com/piscisaureus/wepoll/blob/0857889a0fccdba62654e11ae743dd96d85cc711/.appveyor.yml#L73-L95

Oh beautiful, we are totally going to steal that wholesale. Thank you!


New plan

So it sounds like at this point we now know everything we need to to reimplement select on top of IOCP. This means we can dramatically simplify trio/_core/_io_windows.py (we can get rid of the background thread!), while keeping its external API exactly the same (in particular, trio/_socket.py won't need to change at all), and probably making it faster too. This is great! We should do this.

@njsmith

This comment has been minimized.

Member

njsmith commented Oct 3, 2018

this little article has some useful information on the poorly-documented FILE_SKIP_SET_EVENT_ON_HANDLE (apparently you just always want this in modern software).

@njsmith njsmith added the todo soon label Nov 20, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment