eth: pinging frdm k64f eventually leads to unresponsive ethernet device #16639

aunsbjerg · 2019-06-05T14:59:01Z

Describe the bug
Pinging the frdm_k64f board works for a while, but eventually leads to the board not responding to any ethernet requests. The board also becomes unable to send ping requests to a host. This unresponsiveness persists until the board is reset.

I am reasonably sure that this issue is not specific to ICMP as I have observed the same unresponsive behaviour doing UDP stuff - ICMP is just the most reliable way of provoking the issue.

This issue could be an mcux variant of #11255

Issue is seen on latest master and on v1.14.0 tag.

To Reproduce

cd samples/net/socket/echo_client
west -b frdm_k64f -d build
west flash --runner jlink
sudo ping 192.0.2.2 -i 0.001
ping responses will eventually stop

Expected behavior
I expect the frdm_k64f to not become unresponsive

Impact
The project I'm working on is based on the frdm_k64f and will rely heavily on a stable ethernet connection. This issue is therefore a showstopper for me, especially since it seems to require a device reset to restore functionality.

Screenshots or console output
Wireshark log of when communication stops.

Environment (please complete the following information):

OS: Ubuntu 18.04
Toolchain: Zephyr SDK 0.10.0
Commit: 82497ec

The text was updated successfully, but these errors were encountered:

aunsbjerg · 2019-06-05T18:08:53Z

Interestingly, I have not been able to recreate the issue with the echo_client sample i 1.13. I will take a look at the code differences in the morning.

Also, I suspect #3129 might be related to this issue, the symptoms sound awfully familiar.

jukkar · 2019-06-06T09:55:37Z

I tried frdm-k64f last week and basically the gPTP code was not working at all (time sync packets were not received properly). Dunno what was wrong with it, could be related to #16089. Has the HAL changed recently that could explain this bitrotting?

agansari · 2019-06-06T10:33:09Z

@aunsbjerg i was able to reproduce this issue on my setup with the latest code.
@jukkar HAL changed a bit for Kinetis devices when adding KE1xF soc, but enet device didn't change since I worked on it a few months ago. I remember reproducing #16089 weeks ago. Can you try gPTP again and create a issue or just post here your findings?

jukkar · 2019-06-07T09:48:59Z

@agansari I am seeing these error messages with latest master commit 69de620

<err> eth_mcux: ENET_SendFrame error: 4004

I am using the samples/net/gptp application. The same application works ok in Atmel sam-e70 board.

tbursztyka · 2019-06-07T09:50:19Z

This driver is really broken see also #16089

pfalcon · 2019-06-08T07:02:24Z

@aunsbjerg: Can you please provide more information:

On which ping number the issue happens? Based on wireshark screenshot, that would be ping no ~184, but it would be nice to have explicit number specified. Is that number consistent between different runs?

sudo ping 192.0.2.2 -i 0.001

That's effectively a flood ping. a) Did you ever run such a command against any other type of Zephyr device? (What were results?) b) What happens if you run just normal ping 192.0.2.2? Does failure happen anywhere near the same ping number as discussed in p.1?

aunsbjerg · 2019-06-08T11:31:18Z

@pfalcon

The number is not consistent between different runs. Some general, anecdotal observations:

The higher the ping frequency, that faster the error will occur.
The error also happens at lower traffic rate (1 ping / sec), but it seems to happens less frequently.
The error apparently also happens with UDP traffic. This is actually where I first saw the issue, while developing some UDP protocol for a project.
Using the uart shell while sending ping packets seem to accelerate the issue. Device will often stop replying at the same time when I'm inputting something to the shell.

I have only tried running the ping flood against a qemu target running the echo_client sample. I was not able to reproduce the issue over the course of one hour. I do not have any other development boards, so I cannot reproduce on other zephyr devices.

Hope that helps. Let me know if there is anything else I can do.

pfalcon · 2019-06-08T11:53:37Z

@aunsbjerg: Thanks for the detailed info, should be helpful when reproducing and investigating the issue.

I have only tried running the ping flood against a qemu target running the echo_client sample. I was not able to reproduce the issue over the course of one hour.

Do you use procedure described in https://docs.zephyrproject.org/latest/guides/networking/qemu_setup.html? I.e. SLIP networking in QEMU, the classical one? I find the above report strange, as QEMU has problems with UART emulation, which affects SLIP, which in turn affects networking stability. It was found that such a setup is not suitable for any load testing. That's why newer, non-default, setup with Ethernet emulation was put out (https://docs.zephyrproject.org/latest/guides/networking/qemu_eth_setup.html#networking-with-eth-qemu).

Hope that helps. Let me know if there is anything else I can do.

Thanks for the report and information. I guess next step would be bisecting the tree to find the point where it broke, as we definitely had it working better than failing on 185th ping (but I never used flood pings). I'm glad @agansari has got this ticket, I hope to be able to help with any confirmations or testing of the results needed (but otherwise concentrating on other tasks now).

MaureenHelm · 2019-07-05T15:54:17Z

@agansari can you take a look at this?

agansari · 2019-07-08T13:17:34Z

@MaureenHelm i've been debugging this issue, so far this pull #17396 improves the behavior of the driver, but does not completely fix the issue.
@aunsbjerg can you test this pull on your project?
As @pfalcon mentioned, this is a flood test, it will fail at some point, the problem I have is i can't actually catch the fail point (occurs after minutes/hours). Finding the actual fail point would help by either dropping frames or resetting the device to make it work properly.

aunsbjerg · 2019-07-08T20:00:29Z

@agansari I'm currently on holiday but will be back in the middle of next week, then I'll test your fix

agansari · 2019-07-12T08:59:38Z

Short bug description:
This flooding blocks the DMA bus inside the ENET device (EBERR bit).

Long bug description:
I've found this in my new pull request #17396 tested by @jukkar. It crashed because the buffer descriptor had pending requests, looking further I found that ENET device was disabled by hardware (ENET_ECR[ETHEREN] register set to 0). User manual describes this scenario as being caused by a bus error of uDMA.
By enabling kENET_EBusERInterrupt I can see in eth_mcux_error_isr() that when ping stops/tcp transfers.
Tested this on latest master and my pull. The pull pushes more packets faster as it has less interlocks, crashes faster.
Can be reproduced with both @aunsbjerg issue description and pull test

Solutions:

Mandatory - when ENET_ECR[ETHEREN]=0 ENET device is crashed and needs to enter recovery mode to be turned back on. Currently when device crashes it can't be used anymore.
Optional - find the scenario that pushes DMA to it's limit and optimize data flow.

Debug code

diff --git a/drivers/ethernet/eth_mcux.c b/drivers/ethernet/eth_mcux.c
index 2f0223b753..4b3c3423b9 100644
--- a/drivers/ethernet/eth_mcux.c
+++ b/drivers/ethernet/eth_mcux.c
@@ -816,6 +816,7 @@ static int eth_0_init(struct device *dev)
        enet_config.interrupt |= kENET_RxFrameInterrupt;
        enet_config.interrupt |= kENET_TxFrameInterrupt;
        enet_config.interrupt |= kENET_MiiInterrupt;
+       enet_config.interrupt |= kENET_EBusERInterrupt;
 
 #ifdef CONFIG_ETH_MCUX_PROMISCUOUS_MODE
        enet_config.macSpecialConfig |= kENET_ControlPromiscuousEnable;
@@ -1004,6 +1005,10 @@ static void eth_mcux_error_isr(void *p)
        struct eth_context *context = dev->driver_data;
        u32_t pending = ENET_GetInterruptStatus(ENET);
 
+    if (pending & kENET_EBusERInterrupt) {
+        printk("\n\nISR - kENET_EBusERInterrupt\n\n");
+       }
+
        if (pending & ENET_EIR_MII_MASK) {
                k_work_submit(&context->phy_work);
                ENET_ClearInterruptStatus(ENET, kENET_MiiInterrupt);

TODO:

test on i.MX RT as ENET device is preset there
write recovery code

agansari · 2019-07-12T14:54:27Z

Further debugging I found that it's a MPU related issue; ENET device tries to acces via it's uDMA a user space address in RAM. Will continue further debugging on Monday.

Disabling MPU in eth device's initialization bypasses the issue.
I.e. in eth_0_init() add SYSMPU->CESR &= ~SYSMPU_CESR_VLD_MASK;

agansari · 2019-07-17T10:39:19Z

@aunsbjerg issue is related to MPU disabling ENET's DMA acces to RAM. See pull #17603

aunsbjerg · 2019-07-19T11:58:06Z

@agansari I just did a test run with your PR, and it seems to be working perfectly - good job!

aunsbjerg added the bug The issue is a bug, or the PR is fixing a bug label Jun 5, 2019

MaureenHelm assigned agansari Jun 5, 2019

MaureenHelm added area: Networking platform: NXP NXP priority: high High impact/importance bug labels Jun 5, 2019

ioannisg added this to the v2.0.0 milestone Jun 11, 2019

agansari mentioned this issue Jul 17, 2019

soc: k64f MPU configured to always allow ENET #17603

Merged

ioannisg closed this as completed in #17603 Jul 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eth: pinging frdm k64f eventually leads to unresponsive ethernet device #16639

eth: pinging frdm k64f eventually leads to unresponsive ethernet device #16639

aunsbjerg commented Jun 5, 2019

aunsbjerg commented Jun 5, 2019

jukkar commented Jun 6, 2019

agansari commented Jun 6, 2019

jukkar commented Jun 7, 2019

tbursztyka commented Jun 7, 2019

pfalcon commented Jun 8, 2019 •

edited

Loading

aunsbjerg commented Jun 8, 2019

pfalcon commented Jun 8, 2019

MaureenHelm commented Jul 5, 2019

agansari commented Jul 8, 2019

aunsbjerg commented Jul 8, 2019

agansari commented Jul 12, 2019 •

edited

Loading

agansari commented Jul 12, 2019 •

edited

Loading

agansari commented Jul 17, 2019

aunsbjerg commented Jul 19, 2019

eth: pinging frdm k64f eventually leads to unresponsive ethernet device #16639

eth: pinging frdm k64f eventually leads to unresponsive ethernet device #16639

Comments

aunsbjerg commented Jun 5, 2019

aunsbjerg commented Jun 5, 2019

jukkar commented Jun 6, 2019

agansari commented Jun 6, 2019

jukkar commented Jun 7, 2019

tbursztyka commented Jun 7, 2019

pfalcon commented Jun 8, 2019 • edited Loading

aunsbjerg commented Jun 8, 2019

pfalcon commented Jun 8, 2019

MaureenHelm commented Jul 5, 2019

agansari commented Jul 8, 2019

aunsbjerg commented Jul 8, 2019

agansari commented Jul 12, 2019 • edited Loading

agansari commented Jul 12, 2019 • edited Loading

agansari commented Jul 17, 2019

aunsbjerg commented Jul 19, 2019

pfalcon commented Jun 8, 2019 •

edited

Loading

agansari commented Jul 12, 2019 •

edited

Loading

agansari commented Jul 12, 2019 •

edited

Loading