Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection issues with NetKVM + Windows Server 2022 combination #583

Closed
olljanat opened this issue May 20, 2021 · 24 comments
Closed

Connection issues with NetKVM + Windows Server 2022 combination #583

olljanat opened this issue May 20, 2021 · 24 comments

Comments

@olljanat
Copy link

olljanat commented May 20, 2021

Looks that something have changed on Windows Server 2022 (preview build 20344) comparing to Windows Server 2019 (version 1809) because NetKVM driver does not works correctly on it.

Issues which I have seen are that:

  • RDP connection cannot used as it gets stuck after a while (TightVNC works)
  • Some DNS requests fails and it looks to be happening especially with short names so I guess that problem is at least with small UDP packages:
    image

If I switch to Intel e1000 emulated NIC then those issues disappear:
image

It is also worth to mention that I run this test VM on Nutanix AHV platform so if others are not able to reproduce issue then it might be platform specific too (even when it looks like driver + Windows compatibility issue for me). I tested with both latest version of Nutanix packaged driver and 0.1.190 from here and both have same issue.

PS. I reported this one also to Windows Server Insiders forum https://techcommunity.microsoft.com/t5/windows-server-insiders/connection-issues-with-netkvm-windows-server-2022-combination/m-p/2371881

@sb-ntnx
Copy link
Contributor

sb-ntnx commented May 20, 2021

Hello Olli,

Thank you for reporting this issue. I have had a quick look and indeed can reproduce the same behavior with Windows Server 2022 (preview build 20344), even though the previous preview build does not have this issue.

I found that disabling Offload.Tx.Checksum mitigates the issue. Therefore, may I ask you to verify the same? You can disable Tx Checksum offloading via Device Manager -> Network Adapters -> Open properties of "Nutanix VirtIO Ethernet Adapter" -> Find "Offload.Tx.Checksum" parameter in the list on 'Advanced' tab and then change "Value" to "Disabled" or "TCP(v4)".

BR,
Sergey

@olljanat
Copy link
Author

@sb-ntnx Yes, that looks to be valid workaround. Thanks, I will use that then on my lab.

@leidwang
Copy link

Try to reproduce it on my host, but failed.

qemu version:qemu-kvm-6.0.0-16.module+el8.5.0+10848+2dccc46d.x86_64

The qemu command line:
-device virtio-net-pci,mac=9a:30:66:51:5b:46,id=idCkSbbZ,mq=on,netdev=idoK0011,bus=pcie-root-port-3,addr=0x0
-netdev tap,id=idoK0011,vhost=on,queues=6 \

Related files:
Windows_InsiderPreview_Server_vNext_en-us_20344.iso
virtio-win-prewhql-0.1-199.iso

win2022

@YanVugenfirer
Copy link
Collaborator

@sb-ntnx Hi Sergey

Based on the previous comment, do you think that the issue is specific to AHV?

Thanks,
Yan.

@sb-ntnx
Copy link
Contributor

sb-ntnx commented May 25, 2021

Hi @YanVugenfirer,

Unfortunately, I've not had time to look into the issue yet. I hope to have a further look later this week.

BR,
Sergey

@jborean93
Copy link

jborean93 commented Jun 16, 2021

Just thought I would share that I am encountering this issue with the Server 2022 image using the virtio network adapter in VirtualBox. Interestingly enough QEMU isn't affected. Using the workaround mentioned in #583 (comment) by settings Offload.Tx.Checksum to Disabled does mitigate the issue in VirtualBox for me. A way to set this through PowerShell is:

Get-NetAdapter |
    Where-Object DriverFileName -eq 'netkvm.sys' |
    Set-NetAdapterAdvancedProperty -RegistryKeyword 'Offload.TxChecksum' -RegistryValue 0

@sb-ntnx
Copy link
Contributor

sb-ntnx commented Jun 17, 2021

Sorry for the delay. It has been fairly busy recently, so I've not had time yet to throughly investigate the issue. However, I tested with qemu-6.0.0 (with AHV) and got the same behavior.

BR,
Sergey

@YanVugenfirer
Copy link
Collaborator

@sb-ntnx What's the host networking configuration?
Also, are you using vhost?

According to @leidwang on QEMU\KVM with vhost and Linux bridge, there is no problem.

@sb-ntnx
Copy link
Contributor

sb-ntnx commented Jun 18, 2021

The qemu command line is:

-netdev tap,fd=29,id=hostnet0,vhost=on,vhostfd=31
-device virtio-net-pci,rx_queue_size=256,netdev=hostnet0,id=net0,mac=50:6b:8d:c0:ca:18,bus=pci.0,addr=0x3,bootindex=3

By default, tap interfaces are added to ovs bond, but I tried with a plain Linux switch between two VMs running on the same host:

[root@svc1-3 ~]# ip link set HCK up
[root@svc1-3 ~]# virsh domiflist 017d4bf9-5a49-4bdf-ad39-7d108a0cffc9
Interface  Type       Source     Model       MAC
-------------------------------------------------------
tap1       ethernet   -          virtio      50:6b:8d:9b:59:e6

[root@svc1-3 ~]# virsh domiflist 57b53584-6eb7-46c3-bc0c-a235184c88e7
Interface  Type       Source     Model       MAC
-------------------------------------------------------
tap0       ethernet   -          virtio      50:6b:8d:c0:ca:18
tap2       ethernet   -          e1000       50:6b:8d:e3:5f:f3

[root@svc1-3 ~]# ovs-vsctl del-port tap0
[root@svc1-3 ~]# ovs-vsctl del-port tap1
[root@svc1-3 ~]# brctl addif HCK tap0
[root@svc1-3 ~]# brctl addif HCK tap1

However, what I noticed is that DNS resolution may work and you actually need to examine the packets on the receiving server if the checksum is correct:
Screenshot 1: DNS request coming when running ipconfig /flushdns && ping -n 1 t.co
image

Screenshot 2: DNS request, when trying to resolve t.co using nslookup.exe:
image

In the lab test with Linux bridge, I noticed that for DNS lookups initiated by ping.exe -n 1 t.co, if we compare what leaves the client VM and what arrives at the DHCP server, we see that the packet is bigger and I don't know how to explain it:
image

If the lookup is done via nslookup.exe, then we see the same packet size on the server and client:
image

@YanVugenfirer
Copy link
Collaborator

@sb-ntnx Can you please specify what was the Linux kernel version?

@YanVugenfirer
Copy link
Collaborator

@jborean93 What was the host OS when you ran VirtualBox?

@jborean93
Copy link

@jborean93 What was the host OS when you ran VirtualBox?

It was Fedora 34, the current kernel is 5.12.12-300.fc34.x86_64 but that may be slightly different due to a dnf update in the recent past.

@sb-ntnx
Copy link
Contributor

sb-ntnx commented Jul 3, 2021

@YanVugenfirer: The latest kernel of AHV that I tested was based off 5.4.109

@leidwang
Copy link

leidwang commented Jul 7, 2021

Tested this issue on Win2022,still not hit the problem.
Host:
kernel-4.18.0-321.el8.x86_64
qemu-kvm-6.0.0-22.module+el8.5.0+11695+95588379.x86_64
latest_kernel_583

@leidwang
Copy link

leidwang commented Jul 8, 2021

Still not hit this issue with kernel-5.13.
Env:
kernel-5.13.0-0.rc7.51.el9.x86_64
qemu-kvm-6.0.0-7.el9.x86_64
Guest(Win2022):
Windows_InsiderPreview_Server_vNext_en-us_20344.iso
virtio-win-prewhql-0.1-202.iso
583_kernel5 13

@sb-ntnx
Copy link
Contributor

sb-ntnx commented Jul 8, 2021

@leidwang, thank you for the tests.
@YanVugenfirer: could you point me to the piece of code where csums are calculated?

@YanVugenfirer
Copy link
Collaborator

YanVugenfirer commented Jul 8, 2021

@sb-ntnx
Some explanation regarding checksum offload. In general we will offload checksum calculation to the host. But in the case of TCP fragmentation enabled and IP checksum turned on, we cannot offload both (virtio-net-hdr has only place for one checksum metadata), so we will calculate IP checksum in the driver.
So you can follow the "trail" from bool CNB::BindToDescriptor(CTXDescriptor &Descriptor) in ParaNdis_TX.cpp -> void CNB::PrepareOffloads(virtio_net_hdr *VirtioHeader, PVOID IpHeader, ULONG EthPayloadLength, ULONG L4HeaderOffset) const -> SetupLSO or DoIPHdrCSO and then code in sw_offload.cpp (through ParaNdis_CheckSumVerifyFlat) tTcpIpPacketParsingResult ParaNdis_CheckSumVerify( tCompletePhysicalAddress *pDataPages, ULONG ulDataLength, ULONG ulStartOffset, ULONG flags, BOOLEAN verifyLength, LPCSTR caller)

YanVugenfirer pushed a commit that referenced this issue Jul 22, 2021
…ET_BUFFER

can be greater than size of NET_BUFFER itself. As the result, the size of the
packet can be calculated incorrectly.

Instead of using sum of MDLs lengths, we can cap the max length with the size
of NET_BUFFER.

Bug: #583
Signed-off-by: Sergey Bykov <sergey.bykov@nutanix.com>
@sb-ntnx
Copy link
Contributor

sb-ntnx commented Jul 27, 2021

@YanVugenfirer: I guess we can close this bug out as the issue is understood and the PR is merged into the upstream?

@olljanat: We will make this change as part of the VirtIO 1.1.7 suite. Cannot commit to the release date, but we will try to expedite it.

@YanVugenfirer
Copy link
Collaborator

@YanVugenfirer: I guess we can close this bug out as the issue is understood and the PR is merged into the upstream?

Yes, I will close it. Thanks a lot for PR.

@nchevsky
Copy link

FYI, @sb-ntnx's fix is included in Fedora's 0.1.208-1 release of the virtio-win drivers. 🎉

@zx2c4
Copy link

zx2c4 commented Oct 19, 2021

This caused a crash in SmartOS: TritonDataCenter/illumos-kvm-cmd#25

But moreover, it affects all Windows OSes, not just Win2022, as this condition is hit when using WSK, which is used by various system components and drivers such as WireGuardNT.

The impact isn't just "networking breaks", but also it appears to be leaking uninitialized kernels memory out onto the network, which could contain secrets. Therefore, you might want to promote this to a security release and push it out on Windows Update.

@YanVugenfirer
Copy link
Collaborator

@zx2c4 Thanks.

@wioxjk
Copy link

wioxjk commented Dec 29, 2021

Can confirm that this is an issue in Nutanix AHV with Windows Server 2022.
I have opened up a ticket with them regarding this.

Disabling Offload Tx.Checksum did mitigate the issue.

@sb-ntnx
Copy link
Contributor

sb-ntnx commented Dec 29, 2021

@wioxjk as it's been mentioned in #583 (comment), the fix was delivered in Nutanix VirtIO 1.1.7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants