Adds WSARecvFrom as a Windows fallback #882

lewishazell · 2023-05-24T22:15:41Z

As described in #589, the performance on Windows is very slow. From my experience, this seems to affect reading more than sending. This issue was partially solved in #410 with increased buffer sizes although this didn't seem to do much on my machine.

Looking at the Linux specific socket code, I saw there are a number of optimizations. It uses system calls directly instead of using the standard library and allows batching by making use of SYS_RECVMMSG. So, this pull request makes use of the closest Windows equivalent, WSARecvFrom, to allow batching. It also respects buffer sizes from the listen.write_buffer and listen.read_buffer configuration values.

My home connection is 50-60Mb/s and I am now able to consistently run iperf3 at full speed with these modifications:

[  5] local 10.84.45.122 port 24719 connected to 10.84.45.89 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  6.53 MBytes  54.7 Mbits/sec
[  5]   1.00-2.00   sec  6.90 MBytes  57.9 Mbits/sec
[  5]   2.00-3.01   sec  6.85 MBytes  56.8 Mbits/sec
[  5]   3.01-4.00   sec  6.90 MBytes  58.5 Mbits/sec
[  5]   4.00-5.00   sec  6.89 MBytes  57.9 Mbits/sec
[  5]   5.00-6.00   sec  6.90 MBytes  57.9 Mbits/sec
[  5]   6.00-7.00   sec  6.89 MBytes  57.8 Mbits/sec
[  5]   7.00-8.00   sec  6.90 MBytes  57.9 Mbits/sec
[  5]   8.00-9.00   sec  6.80 MBytes  57.0 Mbits/sec
[  5]   9.00-10.00  sec  6.87 MBytes  57.6 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.05  sec  70.0 MBytes  58.5 Mbits/sec   38             sender
[  5]   0.00-10.00  sec  68.4 MBytes  57.4 Mbits/sec                  receiver

iperf Done.

Versus almost 4Mb/s with the recently released v1.7.1:

[  5] local 10.84.45.122 port 50474 connected to 10.84.45.89 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.01   sec   826 KBytes  6.73 Mbits/sec
[  5]   1.01-2.00   sec   838 KBytes  6.88 Mbits/sec
[  5]   2.00-3.00   sec   599 KBytes  4.92 Mbits/sec
[  5]   3.00-4.00   sec   237 KBytes  1.94 Mbits/sec
[  5]   4.00-5.00   sec   237 KBytes  1.95 Mbits/sec
[  5]   5.00-6.00   sec   732 KBytes  6.01 Mbits/sec
[  5]   6.00-7.01   sec   242 KBytes  1.97 Mbits/sec
[  5]   7.01-8.01   sec   235 KBytes  1.93 Mbits/sec
[  5]   8.01-9.00   sec   235 KBytes  1.94 Mbits/sec
[  5]   9.00-10.00  sec   234 KBytes  1.91 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.05  sec  4.52 MBytes  3.78 Mbits/sec   42             sender
[  5]   0.00-10.00  sec  4.31 MBytes  3.62 Mbits/sec                  receiver

iperf Done.

It may be worth testing this implementation with a faster connection but these results are very promising.

salesforce-cla · 2023-05-24T22:15:46Z

Thanks for the contribution! Before we can merge this, we need @lewishazell to sign the Salesforce Inc. Contributor License Agreement.

lewishazell · 2023-05-25T11:47:50Z

I've now tested this implementation on a faster link between two servers and it's capable of receiving over 200Mb/s.

Reverse mode, remote host 10.84.45.89 is sending
[  5] local 10.84.45.101 port 63360 connected to 10.84.45.89 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  25.2 MBytes   211 Mbits/sec
[  5]   1.00-2.00   sec  24.5 MBytes   206 Mbits/sec
[  5]   2.00-3.00   sec  27.9 MBytes   234 Mbits/sec
[  5]   3.00-4.00   sec  26.2 MBytes   220 Mbits/sec
[  5]   4.00-5.00   sec  26.3 MBytes   220 Mbits/sec
[  5]   5.00-6.00   sec  26.1 MBytes   219 Mbits/sec
[  5]   6.00-7.00   sec  25.2 MBytes   212 Mbits/sec
[  5]   7.00-8.00   sec  27.2 MBytes   228 Mbits/sec
[  5]   8.00-9.00   sec  26.9 MBytes   226 Mbits/sec
[  5]   9.00-10.00  sec  27.1 MBytes   227 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.04  sec   266 MBytes   222 Mbits/sec  192             sender
[  5]   0.00-10.00  sec   263 MBytes   220 Mbits/sec                  receiver

This is slightly slower than the actual link which is 250Mb/s but I think the bottle-neck is the CPU on the low powered Linux host.

nbrownus · 2023-06-13T22:21:35Z

This does look promising!

I have been spending some time lately looking at Windows performance again. I have an old branch that I am preparing a pull request for that does registered io on Windows. The trouble with these APIs are that they were introduced in ~Windows 8.1/Server 2012R2.

This is where your PR comes in handy. Overlapped IO has been in Windows long before 7 and its performance is close to RIO in general. It is a fantastic fallback over the current golang stdlib implementation and captures the remaining Windows market that RIO misses.

I'm far from a Windows expert though and when I was testing this PR I noticed a couple of issues.

WSARecvFrom returns WSAECONNRESET occasionally under load. I did not have the time to determine why this occurs though it did not seem like something we needed to log. It would be nice to understand why it was happening.
GetOverlappedResult returns No service is operating at the destination network endpoint on the remote system. occasionally and I ran out of time trying to understand why.
I noticed a high degree of tcp retrans and my tcp window size couldn't stabilize as a result.
In my limited research I had some confusion around using GetOverlappedResult instead of WSAGetOverlappedResult (there were a few others as well)

With all that said, it would be fantastic to land this PR. Any chance you can join the Nebula OSS Slack group?

lewishazell · 2023-06-20T20:24:58Z

Thanks for checking it out @nbrownus! You definitely caught some things I didn't see.

I had a feeling you would mention the backwards compatibility and I agree it would make a good fallback here. I noticed there is a udp-interface branch, would this be required for runtime fallback functionality?

I'm also not a Windows (or golang) expert and just trying to help fix an issue we face! That also explains some of the oddities you see.

There isn't any particular reason I chose GetOverlappedResult over WSAGetOverlappedResult except that my IDE couldn't seem to find the WSA variant and so I assumed it was a naming quirk in the go windows library. This could be incorrect and may be the reason you see issues under load? I might try using the WSA* functions where possible, which is probably correct, to see if it helps anything before getting to deep into any other possibilities.

I'll join the slack group and try to find some time 😄

lewishazell · 2023-06-28T23:31:38Z

Hey @nbrownus, I looked into the issues.

I was able to reproduce the high re-transmission rate by setting up a test network between a Windows and WSL. As it was, I was seeing up to 1500 re-transmissions per second. Fixing the code by making use of the WSA* variants of functions (and checking the correct buffer it returns) improved stability somewhat. Re-transmission rates stabilized and never exceeded 1000 per second, yet the overall reduction of re-transmissions was negligible.

The real improvement happened when I tweaked the buffer sizes in the config file. Windows' default buffer size is much too small here. For me, listen.read_buffer and listen.write_buffer set to 655360 seemed to dramatically reduce re-transmissions. Instead of seeing almost 10,000 re-transmissions while transferring 2 gigabytes, I now get between 200 and 600 on most runs sending 2 gigabytes at ~650Mbit/s. Does this look acceptable?

I also haven't seen WSAECONNRESET or No service is operating at the destination network endpoint on the remote system errors yet. I'll keep running some load tests.

Please see the pushed changes 👍

lewishazell · 2023-06-29T15:27:41Z

I left a load test running for a few hours and saw an error when attempting to write

time="2023-06-29T10:40:43+01:00" level=error msg="Failed to write outgoing packet" certName=linux error="WSAWaitForMultipleEvents: winapi error #6" localIndex=768892931 remoteIndex=690320809 udpAddr="<nil>" vpnIp=10.0.1.1
runtime.preemptM: duplicatehandle failed; errno=1450
fatal error: runtime.preemptM: duplicatehandle failed

I'm not sure creating and closing handles so frequently is a great idea and only made the code more complicated. I've reimplemented WriteTo using Sendto and I haven't seen that error since.

robdplatt · 2023-11-11T00:47:14Z

@lewishazell Do you have a fork/build that I can test on Server 2012R2? I'm testing a nightly that includes #905 but I'm still stuck around 80 MPS according to netcps. Without Nebula, I'm hitting almost 300 MPS.

salesforce-cla · 2023-11-11T17:51:00Z

Thanks for the contribution! Unfortunately we can't verify the commit author(s): Lewis Hazell <l***@p***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, sign the Salesforce Inc. Contributor License Agreement and this Pull Request will be revalidated.

This initial implementation only supports IPv4 - Uses WSA methods as an equivalent to SYS_RECVMMSG - Respects configuration values for buffers and batch sizes

- Refactor ReadMulti to read the buffer at the index given by WSAWaitForMultipleEvents

- WSASendTo method eventually gave "WSAWaitForMultipleEvents: winapi error slackhq#6" under load - Simplified code with similar performance

…ws-wsa-fallback

lewishazell · 2023-11-11T20:03:51Z

Hey @robdplatt, my branch was a bit behind so I've merged in changes. I'll just test it out and get a build ready.

lewishazell · 2023-11-12T01:32:55Z

@robdplatt I've created a release on my fork including the WSA fallback.

From my testing on 2012R2 this evening, I'm not sure my changes will help. When RIO isn't supported, the following will appear in the logs:

Falling back to standard udp sockets

I didn't see this and saw speeds of almost 300Mbps (over WiFi, don't have a great test rig to hand) which suggests RIO is supported in 2012R2. My changes will only help if you see the above message.

robdplatt · 2023-11-16T17:20:31Z

@robdplatt I've created a release on my fork including the WSA fallback.

From my testing on 2012R2 this evening, I'm not sure my changes will help. When RIO isn't supported, the following will appear in the logs:

Falling back to standard udp sockets

I didn't see this and saw speeds of almost 300Mbps (over WiFi, don't have a great test rig to hand) which suggests RIO is supported in 2012R2. My changes will only help if you see the above message.

Thank you for sharing these changes. However, I'm not seeing those speeds yet, even between two Windows 11 instances. I also didn't see "Falling back to standard udp sockets" in the logs when testing between two Server 2012 R2 instances. Maybe I'm missing something.

Did you happen to specify the buffer sizes in your config?

lewishazell · 2023-11-16T20:03:21Z

Thank you for sharing these changes. However, I'm not seeing those speeds yet, even between two Windows 11 instances. I also didn't see "Falling back to standard udp sockets" in the logs when testing between two Server 2012 R2 instances. Maybe I'm missing something.

Did you happen to specify the buffer sizes in your config?

No problem!

That's strange. As far as I know, RIO (which Windows 11 will use) doesn't use the buffer size config so, if you're not seeing any fallbacks, buffer size won't help.

For WSA, I used buffer sizes of 655360 as it worked well in my environment but YMMV.

You could also try adjusting the MTU if you haven't already. That should apply to all three methods so might be your best bet here.

robdplatt · 2023-11-16T21:03:36Z

Thank you for sharing these changes. However, I'm not seeing those speeds yet, even between two Windows 11 instances. I also didn't see "Falling back to standard udp sockets" in the logs when testing between two Server 2012 R2 instances. Maybe I'm missing something.
Did you happen to specify the buffer sizes in your config?

No problem!

That's strange. As far as I know, RIO (which Windows 11 will use) doesn't use the buffer size config so, if you're not seeing any fallbacks, buffer size won't help.

For WSA, I used buffer sizes of 655360 as it worked well in my environment but YMMV.

You could also try adjusting the MTU if you haven't already. That should apply to all three methods so might be your best bet here.

Changing the MTU settings seems to make it worse. Your build and the nightly are giving me the same results. Win11 <--> Win11 on the same switch is getting me about 50MB/s. Without Nebula, I'm getting 100MB/s. Win11 <--> Server2012 R2 (again same switch) drops to about half that. ~25MB/s. This is, however, a lot better than the 5MB/s I was getting last year or so. For these tests I disabled the lighthouse and relays to ensure the clients were talking directly to each other. I don't see any strange errors in the logs either.

johnmaguire · 2024-04-29T21:50:28Z

Thanks for the work here. I want to apologize that this never made it into a release. However, the next release of Nebula, v1.9.0, will be dropping support for any version of Windows older than Windows 10 / Server 2016 - this is done because Go 1.21 has also dropped support for older versions.

Nebula already has support for RIO which should improve performance for Windows 8+. Therefore, merging this PR would not improve the situation.

salesforce-cla bot added the cla:missing label May 24, 2023

salesforce-cla bot added cla:signed and removed cla:missing labels May 24, 2023

lewishazell force-pushed the windows-optimized-udp branch from b8df553 to 48ab464 Compare May 24, 2023 22:56

lewishazell mentioned this pull request May 25, 2023

Slow Windows Performance #589

Closed

salesforce-cla bot added cla:missing and removed cla:signed labels Nov 11, 2023

lewishazell added 7 commits November 11, 2023 18:18

Initial Windows optimised UDP (slackhq#589)

b1eacfc

This initial implementation only supports IPv4 - Uses WSA methods as an equivalent to SYS_RECVMMSG - Respects configuration values for buffers and batch sizes

Add dual-stack IP support to Windows UDP (slackhq#589)

b8cbb4d

WSA-specific calls for handles and overlapped I/O (slackhq#589)

026b7ee

- Refactor ReadMulti to read the buffer at the index given by WSAWaitForMultipleEvents

Use Sendto instead of WSASendTo in WriteTo (slackhq#589)

56574de

- WSASendTo method eventually gave "WSAWaitForMultipleEvents: winapi error slackhq#6" under load - Simplified code with similar performance

Renamed udp_windows.go to udp_wsa_windows.go

4b543e4

Merge branch 'master' of https://github.com/slackhq/nebula into windo…

872b5ee

…ws-wsa-fallback

Added WSA as a fallback for Windows

8980eb2

lewishazell force-pushed the windows-optimized-udp branch from 5857720 to 8980eb2 Compare November 11, 2023 18:30

salesforce-cla bot added cla:signed and removed cla:missing labels Nov 11, 2023

lewishazell changed the title ~~Optimized UDP for Windows~~ Adds WSARecvFrom as a Windows fallback Nov 12, 2023

johnmaguire closed this Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds WSARecvFrom as a Windows fallback #882

Adds WSARecvFrom as a Windows fallback #882

lewishazell commented May 24, 2023 •

edited

salesforce-cla bot commented May 24, 2023

lewishazell commented May 25, 2023

nbrownus commented Jun 13, 2023

lewishazell commented Jun 20, 2023

lewishazell commented Jun 28, 2023 •

edited

lewishazell commented Jun 29, 2023 •

edited

robdplatt commented Nov 11, 2023

salesforce-cla bot commented Nov 11, 2023

lewishazell commented Nov 11, 2023

lewishazell commented Nov 12, 2023 •

edited

robdplatt commented Nov 16, 2023

lewishazell commented Nov 16, 2023

robdplatt commented Nov 16, 2023

johnmaguire commented Apr 29, 2024

Adds WSARecvFrom as a Windows fallback #882

Adds WSARecvFrom as a Windows fallback #882

Conversation

lewishazell commented May 24, 2023 • edited

salesforce-cla bot commented May 24, 2023

lewishazell commented May 25, 2023

nbrownus commented Jun 13, 2023

lewishazell commented Jun 20, 2023

lewishazell commented Jun 28, 2023 • edited

lewishazell commented Jun 29, 2023 • edited

robdplatt commented Nov 11, 2023

salesforce-cla bot commented Nov 11, 2023

lewishazell commented Nov 11, 2023

lewishazell commented Nov 12, 2023 • edited

robdplatt commented Nov 16, 2023

lewishazell commented Nov 16, 2023

robdplatt commented Nov 16, 2023

johnmaguire commented Apr 29, 2024

lewishazell commented May 24, 2023 •

edited

lewishazell commented Jun 28, 2023 •

edited

lewishazell commented Jun 29, 2023 •

edited

lewishazell commented Nov 12, 2023 •

edited