-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pinging sam_e70 leads to unresponsive ethernet device at some point #11255
Comments
I can't reproduce this problem with PR #11888 applied. Even with the data cache disabled, the existing code is missing a data synchronization barrier after writing the descriptor list. I got a few packets lost, but I guess it's normal at this rate. The Ethernet interface of the board is still working fine after this.
|
As discussed in PR#11888, I am convinced this bug is due to the timeout code in the TX part of the Ethernet driver. Copying the description from the other PR: I have found it happens when the Zephyr network stack has to sent 2 packets in row, for example an ack for the just received data + answer data. This seems to be a latent issue, but it happens more often given that instruction cache and now data cache have been enabled. Basically the CPU is now faster than Ethernet device (which is good in some sense). The problem seems to be the following one:
I am not fully sure about how exactly the steps 2 and 3 are interleaved, what is sure is that the call to tx_completed is skipped for the second packet has the job has already been cancelled. Adding some delay between sending packets (for instance at the beginning of eth_tx) workarounds the issue. @mnkp suggested:
This is basically the strategy used in some the other Ethernet drivers. I am not sure it applies here, as there is still the risk of filling descriptors while the previous frame from the queue hasn't been fully sent. I am not sure how the Ethernet device split the data in the queue into multiple frames, or even if it is able to do that. If we need to ensure we have a single frame in a queue at a given moment, I guess we should use the same technique, but with an additional semaphore working on frames instead of fragments. |
I have found that this is done by tagging the last fragment of the frame with |
I have just tried and simulated an issue by ignoring one Ethernet TX interrupt over 100. It doesn't work as it leads to a net buffer exhaustion before the semaphore exhaustion. I think we should keep firing a delayed work after the timeout (defaulting to 5s) and just check all descriptors. If one is not empty, there has been a problem and we should just call tx_error_handler. As long as there is network traffic, this delayed work should not fire given the timeout being updated regularly. I don't think that going through the descriptors list once 5 second after the last transmitted packet is really problematic. We should just ensure that no new packets is sent during that check, maybe using a semaphore to access the descriptor list instead of disabling interrupts and comparing |
Even if an occasional interrupt was lost the semaphore count should be kept in sync with net buffer count. If it's not that's a driver bug. But maybe you meant net pkt count, i.e. the difference between data structures defined by NET_BUF_TX_COUNT and NET_PKT_TX_COUNT? Sorry for a late reply. I'm no longer working with Atmel chips. That become a hobby project. |
The TX semaphore count is not kept in sync with the TX net buffer count, but it's not a driver issue. The assumption that only TX net buffer are used is the TX path is wrong. For instance in the ICMP code, the RX net buffers are reused in the TX path. What I observe when pinging the board is an exhaustion of the RX net buffers. In addition to that I guess that sending small packets using a single net buffer with |
I wonder if we should, at least at a first step, remove this timeout error handling. It is there to prevent a lock-up if there is a bug in the driver or in case of hardware malfunction. Right now we know that the timeout bug happens with high traffic and actually causes the lock-up it is supposed to prevent. Also note that in case of hardware malfunction detected by the hardware (i.e. through a TX error interrupt), the whole descriptor list / net buffer are correctly reinitialized. |
Thanks for pointing it out. That adds one more dimension to take into account. I'll have to think a bit more about the issue. Also, in case of sending small frames we could likely have net packet exhaustion before we run out of net buffers.
Yes, that's a good idea. We need more time to properly implement timeout, no need to keep the driver broken. |
@jukkar I have seen you got this issue assigned. I have understood that more than one issue was ending up in blocking the TX path, and especially one introduced by me working with the patches from PR#11888, I finally have a set of patches almost ready. I just need to write proper commit messages, that are quite important for this kind of changes. I'll submit that in the next days. |
@aurel32 thanks for the info, I am looking forward to try any fixes for this. |
On target a net ping shows it can send/recv packets.
From an external host: ping -i 0.00001 will work for some time until it cannot reach the target. On target side, a net ping does not work anymore.
Seems like either the ethernet driver or the ethernet device on the sam_e70 gets unresponsive.
The text was updated successfully, but these errors were encountered: