Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds WSARecvFrom as a Windows fallback #882

Closed
wants to merge 7 commits into from

Conversation

lewishazell
Copy link

@lewishazell lewishazell commented May 24, 2023

As described in #589, the performance on Windows is very slow. From my experience, this seems to affect reading more than sending. This issue was partially solved in #410 with increased buffer sizes although this didn't seem to do much on my machine.

Looking at the Linux specific socket code, I saw there are a number of optimizations. It uses system calls directly instead of using the standard library and allows batching by making use of SYS_RECVMMSG. So, this pull request makes use of the closest Windows equivalent, WSARecvFrom, to allow batching. It also respects buffer sizes from the listen.write_buffer and listen.read_buffer configuration values.

My home connection is 50-60Mb/s and I am now able to consistently run iperf3 at full speed with these modifications:

[  5] local 10.84.45.122 port 24719 connected to 10.84.45.89 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  6.53 MBytes  54.7 Mbits/sec
[  5]   1.00-2.00   sec  6.90 MBytes  57.9 Mbits/sec
[  5]   2.00-3.01   sec  6.85 MBytes  56.8 Mbits/sec
[  5]   3.01-4.00   sec  6.90 MBytes  58.5 Mbits/sec
[  5]   4.00-5.00   sec  6.89 MBytes  57.9 Mbits/sec
[  5]   5.00-6.00   sec  6.90 MBytes  57.9 Mbits/sec
[  5]   6.00-7.00   sec  6.89 MBytes  57.8 Mbits/sec
[  5]   7.00-8.00   sec  6.90 MBytes  57.9 Mbits/sec
[  5]   8.00-9.00   sec  6.80 MBytes  57.0 Mbits/sec
[  5]   9.00-10.00  sec  6.87 MBytes  57.6 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.05  sec  70.0 MBytes  58.5 Mbits/sec   38             sender
[  5]   0.00-10.00  sec  68.4 MBytes  57.4 Mbits/sec                  receiver

iperf Done.

Versus almost 4Mb/s with the recently released v1.7.1:

[  5] local 10.84.45.122 port 50474 connected to 10.84.45.89 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.01   sec   826 KBytes  6.73 Mbits/sec
[  5]   1.01-2.00   sec   838 KBytes  6.88 Mbits/sec
[  5]   2.00-3.00   sec   599 KBytes  4.92 Mbits/sec
[  5]   3.00-4.00   sec   237 KBytes  1.94 Mbits/sec
[  5]   4.00-5.00   sec   237 KBytes  1.95 Mbits/sec
[  5]   5.00-6.00   sec   732 KBytes  6.01 Mbits/sec
[  5]   6.00-7.01   sec   242 KBytes  1.97 Mbits/sec
[  5]   7.01-8.01   sec   235 KBytes  1.93 Mbits/sec
[  5]   8.01-9.00   sec   235 KBytes  1.94 Mbits/sec
[  5]   9.00-10.00  sec   234 KBytes  1.91 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.05  sec  4.52 MBytes  3.78 Mbits/sec   42             sender
[  5]   0.00-10.00  sec  4.31 MBytes  3.62 Mbits/sec                  receiver

iperf Done.

It may be worth testing this implementation with a faster connection but these results are very promising.

@salesforce-cla
Copy link

Thanks for the contribution! Before we can merge this, we need @lewishazell to sign the Salesforce Inc. Contributor License Agreement.

@lewishazell
Copy link
Author

I've now tested this implementation on a faster link between two servers and it's capable of receiving over 200Mb/s.

Reverse mode, remote host 10.84.45.89 is sending
[  5] local 10.84.45.101 port 63360 connected to 10.84.45.89 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  25.2 MBytes   211 Mbits/sec
[  5]   1.00-2.00   sec  24.5 MBytes   206 Mbits/sec
[  5]   2.00-3.00   sec  27.9 MBytes   234 Mbits/sec
[  5]   3.00-4.00   sec  26.2 MBytes   220 Mbits/sec
[  5]   4.00-5.00   sec  26.3 MBytes   220 Mbits/sec
[  5]   5.00-6.00   sec  26.1 MBytes   219 Mbits/sec
[  5]   6.00-7.00   sec  25.2 MBytes   212 Mbits/sec
[  5]   7.00-8.00   sec  27.2 MBytes   228 Mbits/sec
[  5]   8.00-9.00   sec  26.9 MBytes   226 Mbits/sec
[  5]   9.00-10.00  sec  27.1 MBytes   227 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.04  sec   266 MBytes   222 Mbits/sec  192             sender
[  5]   0.00-10.00  sec   263 MBytes   220 Mbits/sec                  receiver

This is slightly slower than the actual link which is 250Mb/s but I think the bottle-neck is the CPU on the low powered Linux host.

@nbrownus
Copy link
Collaborator

This does look promising!

I have been spending some time lately looking at Windows performance again. I have an old branch that I am preparing a pull request for that does registered io on Windows. The trouble with these APIs are that they were introduced in ~Windows 8.1/Server 2012R2.

This is where your PR comes in handy. Overlapped IO has been in Windows long before 7 and its performance is close to RIO in general. It is a fantastic fallback over the current golang stdlib implementation and captures the remaining Windows market that RIO misses.

I'm far from a Windows expert though and when I was testing this PR I noticed a couple of issues.

  • WSARecvFrom returns WSAECONNRESET occasionally under load. I did not have the time to determine why this occurs though it did not seem like something we needed to log. It would be nice to understand why it was happening.
  • GetOverlappedResult returns No service is operating at the destination network endpoint on the remote system. occasionally and I ran out of time trying to understand why.
  • I noticed a high degree of tcp retrans and my tcp window size couldn't stabilize as a result.
  • In my limited research I had some confusion around using GetOverlappedResult instead of WSAGetOverlappedResult (there were a few others as well)

With all that said, it would be fantastic to land this PR. Any chance you can join the Nebula OSS Slack group?

@lewishazell
Copy link
Author

Thanks for checking it out @nbrownus! You definitely caught some things I didn't see.

I had a feeling you would mention the backwards compatibility and I agree it would make a good fallback here. I noticed there is a udp-interface branch, would this be required for runtime fallback functionality?

I'm also not a Windows (or golang) expert and just trying to help fix an issue we face! That also explains some of the oddities you see.

There isn't any particular reason I chose GetOverlappedResult over WSAGetOverlappedResult except that my IDE couldn't seem to find the WSA variant and so I assumed it was a naming quirk in the go windows library. This could be incorrect and may be the reason you see issues under load? I might try using the WSA* functions where possible, which is probably correct, to see if it helps anything before getting to deep into any other possibilities.

I'll join the slack group and try to find some time 😄

@lewishazell
Copy link
Author

lewishazell commented Jun 28, 2023

Hey @nbrownus, I looked into the issues.

I was able to reproduce the high re-transmission rate by setting up a test network between a Windows and WSL. As it was, I was seeing up to 1500 re-transmissions per second. Fixing the code by making use of the WSA* variants of functions (and checking the correct buffer it returns) improved stability somewhat. Re-transmission rates stabilized and never exceeded 1000 per second, yet the overall reduction of re-transmissions was negligible.

The real improvement happened when I tweaked the buffer sizes in the config file. Windows' default buffer size is much too small here. For me, listen.read_buffer and listen.write_buffer set to 655360 seemed to dramatically reduce re-transmissions. Instead of seeing almost 10,000 re-transmissions while transferring 2 gigabytes, I now get between 200 and 600 on most runs sending 2 gigabytes at ~650Mbit/s. Does this look acceptable?

I also haven't seen WSAECONNRESET or No service is operating at the destination network endpoint on the remote system errors yet. I'll keep running some load tests.

Please see the pushed changes 👍

@lewishazell
Copy link
Author

lewishazell commented Jun 29, 2023

I left a load test running for a few hours and saw an error when attempting to write

time="2023-06-29T10:40:43+01:00" level=error msg="Failed to write outgoing packet" certName=linux error="WSAWaitForMultipleEvents: winapi error #6" localIndex=768892931 remoteIndex=690320809 udpAddr="<nil>" vpnIp=10.0.1.1
runtime.preemptM: duplicatehandle failed; errno=1450
fatal error: runtime.preemptM: duplicatehandle failed

I'm not sure creating and closing handles so frequently is a great idea and only made the code more complicated. I've reimplemented WriteTo using Sendto and I haven't seen that error since.

@robdplatt
Copy link

@lewishazell Do you have a fork/build that I can test on Server 2012R2? I'm testing a nightly that includes #905 but I'm still stuck around 80 MPS according to netcps. Without Nebula, I'm hitting almost 300 MPS.

Copy link

Thanks for the contribution! Unfortunately we can't verify the commit author(s): Lewis Hazell <l***@p***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, sign the Salesforce Inc. Contributor License Agreement and this Pull Request will be revalidated.

This initial implementation only supports IPv4

- Uses WSA methods as an equivalent to SYS_RECVMMSG
- Respects configuration values for buffers and batch sizes
- Refactor ReadMulti to read the buffer at the index given by WSAWaitForMultipleEvents
- WSASendTo method eventually gave "WSAWaitForMultipleEvents: winapi error slackhq#6" under load
- Simplified code with similar performance
@lewishazell
Copy link
Author

Hey @robdplatt, my branch was a bit behind so I've merged in changes. I'll just test it out and get a build ready.

@lewishazell
Copy link
Author

lewishazell commented Nov 12, 2023

@robdplatt I've created a release on my fork including the WSA fallback.

From my testing on 2012R2 this evening, I'm not sure my changes will help. When RIO isn't supported, the following will appear in the logs:

Falling back to standard udp sockets

I didn't see this and saw speeds of almost 300Mbps (over WiFi, don't have a great test rig to hand) which suggests RIO is supported in 2012R2. My changes will only help if you see the above message.

@lewishazell lewishazell changed the title Optimized UDP for Windows Adds WSARecvFrom as a Windows fallback Nov 12, 2023
@robdplatt
Copy link

@robdplatt I've created a release on my fork including the WSA fallback.

From my testing on 2012R2 this evening, I'm not sure my changes will help. When RIO isn't supported, the following will appear in the logs:

Falling back to standard udp sockets

I didn't see this and saw speeds of almost 300Mbps (over WiFi, don't have a great test rig to hand) which suggests RIO is supported in 2012R2. My changes will only help if you see the above message.

Thank you for sharing these changes. However, I'm not seeing those speeds yet, even between two Windows 11 instances. I also didn't see "Falling back to standard udp sockets" in the logs when testing between two Server 2012 R2 instances. Maybe I'm missing something.

Did you happen to specify the buffer sizes in your config?

@lewishazell
Copy link
Author

Thank you for sharing these changes. However, I'm not seeing those speeds yet, even between two Windows 11 instances. I also didn't see "Falling back to standard udp sockets" in the logs when testing between two Server 2012 R2 instances. Maybe I'm missing something.

Did you happen to specify the buffer sizes in your config?

No problem!

That's strange. As far as I know, RIO (which Windows 11 will use) doesn't use the buffer size config so, if you're not seeing any fallbacks, buffer size won't help.

For WSA, I used buffer sizes of 655360 as it worked well in my environment but YMMV.

You could also try adjusting the MTU if you haven't already. That should apply to all three methods so might be your best bet here.

@robdplatt
Copy link

Thank you for sharing these changes. However, I'm not seeing those speeds yet, even between two Windows 11 instances. I also didn't see "Falling back to standard udp sockets" in the logs when testing between two Server 2012 R2 instances. Maybe I'm missing something.
Did you happen to specify the buffer sizes in your config?

No problem!

That's strange. As far as I know, RIO (which Windows 11 will use) doesn't use the buffer size config so, if you're not seeing any fallbacks, buffer size won't help.

For WSA, I used buffer sizes of 655360 as it worked well in my environment but YMMV.

You could also try adjusting the MTU if you haven't already. That should apply to all three methods so might be your best bet here.

Changing the MTU settings seems to make it worse. Your build and the nightly are giving me the same results. Win11 <--> Win11 on the same switch is getting me about 50MB/s. Without Nebula, I'm getting 100MB/s. Win11 <--> Server2012 R2 (again same switch) drops to about half that. ~25MB/s. This is, however, a lot better than the 5MB/s I was getting last year or so. For these tests I disabled the lighthouse and relays to ensure the clients were talking directly to each other. I don't see any strange errors in the logs either.

@johnmaguire
Copy link
Collaborator

Thanks for the work here. I want to apologize that this never made it into a release. However, the next release of Nebula, v1.9.0, will be dropping support for any version of Windows older than Windows 10 / Server 2016 - this is done because Go 1.21 has also dropped support for older versions.

Nebula already has support for RIO which should improve performance for Windows 8+. Therefore, merging this PR would not improve the situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants