-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UDP6 checksum problem #1832
Comments
If it's a function of a setting of the onboard NIC, then have you tried using a different USB-ethernet adapter? There are multiple chipsets available on the open market from different manufacturers. This would tell you for definite if it's a function of hardware. |
Do you believe checksums are correct except the case where the checksum is 0x0000 which should be replaced by 0xffff? Or are you getting many spurious 0x0000 checksums? |
@P33M if he fixed it using ethtool, it's either the hardware, the firmware for the NIC (I don't recall whether the chip the Pi's use has loadable firmware), or the driver. In any of the three cases, if he gets an adapter with a different chipset, I can almost guarantee it will work (the almost referring to the possibility that he happens to find a different chipset that has exactly the same behavior). |
I suspect the issues lies in here: I'm not sure if it needs a check for 0x0000 and replacement with 0xffff. |
Possibly. I hadn't noticed the ethtool command used before, but that implies heavily that it's hardware or firmware, because that particular ethtool command disables checksum offloading. @sconemad Do things work when you only disable offloading for transmit, or do you have to disable both to get it to work? |
Changing the smsc driver there will only fix for packets < length 45 as they use a SW checksum, otherwise I presume the HW inserts the checksum on transmit, after this function. IIR the code correctly! Might be worth hacking it to always use SW checksum to see if it fixes the problem. |
I changed the |
The driver seems to default to having TX checksum offloading enabled, which would suggest that the @popcornmix any chance you could try that modified kernel and then run |
I'm just testing the replacement of 0 with 0xffff and it is running longer. |
@JamesH65 If however the hardware is still checksumming the packets that would match that section for some reason, then there's no point in having the section as-is, since the hardware would be stomping on the software checksum anyway. Looking at the code again, if that is the case, then it's another (possibly related, possibly unrelated) hardware bug. I could of course be completely off-base on this, most of my expertise regarding kernel code is very high-level stuff, not drivers... |
@popcornmix Yes, I think the problem is only with packets where the checksum is calculated to be 0, which should be replaced with 0xFFFF. |
@Ferroin This patch fixes the problem with hardware checksums enabled (the default set up), but obviously wastes cpu always calculating crcs. Unfortunately I can't find the checksum in the not With the patch present, running |
BTW: I have emailed the author of this code and he has pinged microchip so perhaps we will get some advice from them. |
@Ferroin Yes, I only need to disable transmit offloading to make it work. I guess this means that the receive offloading works, since the returned datagram is going to have the same checksum in my test code, since only the source and destination addresses are flipped. Thanks for looking into this everyone. Please let me know if you'd like me to test anything. |
@sconemad: Is this for personal use? I went to Microchip's website and they have a newer driver but you have to agree to some legal terms to download which I think includes non-disclosure. I did download it and briefly looked at it. If you aren't planning to distribute the source code or this is for a closed source device, you can probably just use Microchip's driver. Or at least test it and see if it fixes the issue. Otherwise, you will probably just have to wait for the original author to fix it. The product appears to be now called LAN9514. http://www.microchip.com/wwwproducts/en/LAN9514 |
My understanding of the code is that the line 1934 if (csum) Adds a flag to the packet to enable the HW checksum to be added later. Which is why you cannot see it in this function, and cannot correct it in this function. I have no idea how to get round that - it would appear that this is a firmware issue and will need to be fixed there. |
I should clarify; if we force SW checksums all the time we can fix it here (as per Dom's change), but if we want to use HW checksums, I suspect that needs a firmware change. If no firmware change, it's a trade off between added CPU burden, vs losing a packet and requiring a retry on average every 65k. Might be a sweet spot dependent on packet size. ie over a particular size, use HW anyway and take the chance of a bad checksum, because the CPU burden is too high the rest of the time. |
The problem for our application (and the reason we noticed this bug), is
that retrying failed packets won't work! When we resend the dropped packet,
it's sent unchanged, and indeed, without corrupting the data we're trying
to send it's not possible to change the data on retransmission. So, every
retransmission also fails due to the checksum. Random packet loss with UDP
is bearable, but systematic packet loss is crippling.
|
(sorry, pressed the wrong button!) I was basically going to say what @NWilson said above (he is the one who originally discovered this issue BTW). This was found in testing RealVNC Server (included in Raspbian), so we're quite keen to see a general solution for this, although it's nice to know that there are ways to workaround it. |
If someone could do some testing to get numbers both on how much forcing SW checksumming would impact network throughput and how much CPU load it would impose, that would be good. |
@Ferroin - That kind of test may actually be a bunch of work to write. Lots of different configurations to test, etc. udp performance is hard to test anyway. At this point, maybe it is best to wait and see what the driver author and Microchip say about this. I know Microchip is generally very open about hardware and firmware issues. They are also constantly tweaking the devices so this issue may in fact only happen on certain revisions or it may happen in only very specific cases which are easy to detect and correct in the driver. If it is indeed a firmware issue instead of a silicon issue, it may even be possible to update the firmware. |
I'm not sure how helpful this is, but I was able to do some fairly simple tests by modifying my test program to send and receive 100,000 x 1000 byte datagrams, and measuring the run time - which came out to be around 1 minute. (Luckily I didn't get any packets where the 0 checksum) Running it quite a few times with checksum offloading on and off, I wasn't able to detect any significant difference in any of the timings. |
Do you have a Pi1 to try it on? The impact on a single core ARM of this era
may be significant, and we need to ensure things still work OK on all Pi
models.
…On 9 February 2017 at 10:05, Andrew Wedgbury ***@***.***> wrote:
I'm not sure how helpful this is, but I was able to do some fairly simple
tests by modifying my test program to send and receive 100,000 x 1000 byte
datagrams, and measuring the run time - which came out to be around 1
minute. (Luckily I didn't get any packets where the 0 checksum)
Running it quite a few times with checksum offloading on and off, I wasn't
able to detect any significant difference in any of the timings.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1832 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADqrHexuYhD3UQpJdvAX7QypPu503Prfks5rauTSgaJpZM4L5mWh>
.
--
James Hughes
Principal Software Engineer,
Raspberry Pi (Trading) Ltd
|
This is on a Pi1 at 700MHz
So with HW checksums on we get: So with SW checksums on we get: So I'm not measuring a difference with this test. |
I have just discovered a related issue that IPv6 UDP checksum calculation is failing for very small packets (see #1944). Switching off hardware offloading using ethtool fixes the problem. Looks like the LAN9514 is a bit dodgy for IPv6... |
From other issue: |
Any news on a fix from Microchip? |
No news. I've pinged them - I'll report here when there is any news. |
Update: "The work is ongoing. Should see the patch next week." |
kernel: irq_bcm2836: Send event when onlining sleeping cores kernel: ARM: dts: bcm283x: Reserve first page for firmware See: raspberrypi/linux#1989 kernel: smsc95xx: Avoid HW TX CSUM for IPV6 See: raspberrypi/linux#1832
kernel: irq_bcm2836: Send event when onlining sleeping cores kernel: ARM: dts: bcm283x: Reserve first page for firmware See: raspberrypi/linux#1989 kernel: smsc95xx: Avoid HW TX CSUM for IPV6 See: raspberrypi/linux#1832
Patch from microchip has been submitted to netdev mailing list. |
Tested with kernel 4.9.27-v7+ #997 SMP Tue May 9 19:58:37 BST 2017 obtained using rpi-update. Both my udpcs test app and RealVNC server no longer encounter this problem, so this appears to have fixed it. Thanks! |
I've found a problem in the way that the UDP checksum is calculated for outgoing UDP packets over IPv6 (I'm using the latest up-to-date Raspbian on a Pi 3 if that makes any difference)
I'm testing an application on the Pi which uses UDP transport over IPv6, sending datagrams to another machine. I've noticed that occasionally a datagram is sent from the Pi which is never received by the application on the other machine, even when repeatedly re-transmitted. Using wireshark, I was able to see the UDP packets arriving on the other machine, and noticed that the "missing" packets had a UDP checksum of 0.
According to RFC 2460, checksums are not optional for UDP packets over IPv6 like they are for IPv4, so the value 0 is invalid (in the case where the checksum is calculated to be 0, it's supposed to be replaced with 0xFFFF). I believe this is why the other machine is discarding these packets. Since the checksum is a 16 bit number, this would happen on average for 1 in every 65536 packets.
I've written a small test program in c to demonstrate this issue (attached as udpcs.c.txt).
You'll need a Pi with Raspbian, and another machine (e.g. Linux x86), connected over Ethernet with IPv6 supported.
Copy the test program onto both machines and build with:
gcc -o udpcs udpcs.c
On the other (non-pi) machine, run
./udpcs
. This will start a simple UDP6 echo server.On the Pi, run
./udpcs SOURCE DEST
Where SOURCE is the IPv6 address of the Pi, and DEST is the IPv6 address of the other machine. This will send test UDP6 datagrams to the server, consisting of a sequence number and random data, and wait for the datagram to be echoed back, retrying if there is no reply within 1 second. It also calculates and reports the UDP checksum for each packet.
When I run this, I get a steady stream of datagrams being sent and received, but at some point it will stop and keep retrying, where the packet checksum is 0xFFFF. Looking at the output on the other machine, you can see the packet in question is never received.
Here's an example of what I get on the client:
And on the server:
(obviously, the exact point where it stops will vary, since the UDP checksum depends on the source and destination IPv6 addresses and the random data).
By default on the Pi, UDP checksums are calculated on the NIC before sending (using checksum offloading), which can be disabled by installing "ethtool" and running
sudo ethtool --offload eth0 rx off tx off
. Doing this makes the problem go away, which would indicate that the NIC's checksum calculation algorithm is at fault. Does anyone know where this lives, and how to go about getting it fixed?udpcs.c.txt
The text was updated successfully, but these errors were encountered: