Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eth: pinging frdm k64f eventually leads to unresponsive ethernet device #16639

Closed
aunsbjerg opened this issue Jun 5, 2019 · 15 comments · Fixed by #17603
Closed

eth: pinging frdm k64f eventually leads to unresponsive ethernet device #16639

aunsbjerg opened this issue Jun 5, 2019 · 15 comments · Fixed by #17603
Assignees
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug platform: NXP NXP priority: high High impact/importance bug
Milestone

Comments

@aunsbjerg
Copy link
Collaborator

Describe the bug
Pinging the frdm_k64f board works for a while, but eventually leads to the board not responding to any ethernet requests. The board also becomes unable to send ping requests to a host. This unresponsiveness persists until the board is reset.

I am reasonably sure that this issue is not specific to ICMP as I have observed the same unresponsive behaviour doing UDP stuff - ICMP is just the most reliable way of provoking the issue.

This issue could be an mcux variant of #11255

Issue is seen on latest master and on v1.14.0 tag.

To Reproduce

  1. cd samples/net/socket/echo_client
  2. west -b frdm_k64f -d build
  3. west flash --runner jlink
  4. sudo ping 192.0.2.2 -i 0.001
  5. ping responses will eventually stop

Expected behavior
I expect the frdm_k64f to not become unresponsive

Impact
The project I'm working on is based on the frdm_k64f and will rely heavily on a stable ethernet connection. This issue is therefore a showstopper for me, especially since it seems to require a device reset to restore functionality.

Screenshots or console output
Wireshark log of when communication stops.

Screenshot from 2019-06-05 16-55-39

Environment (please complete the following information):

  • OS: Ubuntu 18.04
  • Toolchain: Zephyr SDK 0.10.0
  • Commit: 82497ec
@aunsbjerg aunsbjerg added the bug The issue is a bug, or the PR is fixing a bug label Jun 5, 2019
@aunsbjerg
Copy link
Collaborator Author

Interestingly, I have not been able to recreate the issue with the echo_client sample i 1.13. I will take a look at the code differences in the morning.

Also, I suspect #3129 might be related to this issue, the symptoms sound awfully familiar.

@jukkar
Copy link
Member

jukkar commented Jun 6, 2019

I tried frdm-k64f last week and basically the gPTP code was not working at all (time sync packets were not received properly). Dunno what was wrong with it, could be related to #16089. Has the HAL changed recently that could explain this bitrotting?

@agansari
Copy link
Collaborator

agansari commented Jun 6, 2019

@aunsbjerg i was able to reproduce this issue on my setup with the latest code.
@jukkar HAL changed a bit for Kinetis devices when adding KE1xF soc, but enet device didn't change since I worked on it a few months ago. I remember reproducing #16089 weeks ago. Can you try gPTP again and create a issue or just post here your findings?

@jukkar
Copy link
Member

jukkar commented Jun 7, 2019

@agansari I am seeing these error messages with latest master commit 69de620

<err> eth_mcux: ENET_SendFrame error: 4004

I am using the samples/net/gptp application. The same application works ok in Atmel sam-e70 board.

@tbursztyka
Copy link
Collaborator

This driver is really broken see also #16089

@pfalcon
Copy link
Contributor

pfalcon commented Jun 8, 2019

@aunsbjerg: Can you please provide more information:

  1. On which ping number the issue happens? Based on wireshark screenshot, that would be ping no ~184, but it would be nice to have explicit number specified. Is that number consistent between different runs?

sudo ping 192.0.2.2 -i 0.001

That's effectively a flood ping. a) Did you ever run such a command against any other type of Zephyr device? (What were results?) b) What happens if you run just normal ping 192.0.2.2? Does failure happen anywhere near the same ping number as discussed in p.1?

@aunsbjerg
Copy link
Collaborator Author

@pfalcon

The number is not consistent between different runs. Some general, anecdotal observations:

  • The higher the ping frequency, that faster the error will occur.
  • The error also happens at lower traffic rate (1 ping / sec), but it seems to happens less frequently.
  • The error apparently also happens with UDP traffic. This is actually where I first saw the issue, while developing some UDP protocol for a project.
  • Using the uart shell while sending ping packets seem to accelerate the issue. Device will often stop replying at the same time when I'm inputting something to the shell.

I have only tried running the ping flood against a qemu target running the echo_client sample. I was not able to reproduce the issue over the course of one hour. I do not have any other development boards, so I cannot reproduce on other zephyr devices.

Hope that helps. Let me know if there is anything else I can do.

@pfalcon
Copy link
Contributor

pfalcon commented Jun 8, 2019

@aunsbjerg: Thanks for the detailed info, should be helpful when reproducing and investigating the issue.

I have only tried running the ping flood against a qemu target running the echo_client sample. I was not able to reproduce the issue over the course of one hour.

Do you use procedure described in https://docs.zephyrproject.org/latest/guides/networking/qemu_setup.html? I.e. SLIP networking in QEMU, the classical one? I find the above report strange, as QEMU has problems with UART emulation, which affects SLIP, which in turn affects networking stability. It was found that such a setup is not suitable for any load testing. That's why newer, non-default, setup with Ethernet emulation was put out (https://docs.zephyrproject.org/latest/guides/networking/qemu_eth_setup.html#networking-with-eth-qemu).

Hope that helps. Let me know if there is anything else I can do.

Thanks for the report and information. I guess next step would be bisecting the tree to find the point where it broke, as we definitely had it working better than failing on 185th ping (but I never used flood pings). I'm glad @agansari has got this ticket, I hope to be able to help with any confirmations or testing of the results needed (but otherwise concentrating on other tasks now).

@ioannisg ioannisg added this to the v2.0.0 milestone Jun 11, 2019
@MaureenHelm
Copy link
Member

@agansari can you take a look at this?

@agansari
Copy link
Collaborator

agansari commented Jul 8, 2019

@MaureenHelm i've been debugging this issue, so far this pull #17396 improves the behavior of the driver, but does not completely fix the issue.
@aunsbjerg can you test this pull on your project?
As @pfalcon mentioned, this is a flood test, it will fail at some point, the problem I have is i can't actually catch the fail point (occurs after minutes/hours). Finding the actual fail point would help by either dropping frames or resetting the device to make it work properly.

@aunsbjerg
Copy link
Collaborator Author

@agansari I'm currently on holiday but will be back in the middle of next week, then I'll test your fix

@agansari
Copy link
Collaborator

agansari commented Jul 12, 2019

Short bug description:
This flooding blocks the DMA bus inside the ENET device (EBERR bit).

Long bug description:
I've found this in my new pull request #17396 tested by @jukkar. It crashed because the buffer descriptor had pending requests, looking further I found that ENET device was disabled by hardware (ENET_ECR[ETHEREN] register set to 0). User manual describes this scenario as being caused by a bus error of uDMA.
By enabling kENET_EBusERInterrupt I can see in eth_mcux_error_isr() that when ping stops/tcp transfers.
Tested this on latest master and my pull. The pull pushes more packets faster as it has less interlocks, crashes faster.
Can be reproduced with both @aunsbjerg issue description and pull test

Solutions:

  1. Mandatory - when ENET_ECR[ETHEREN]=0 ENET device is crashed and needs to enter recovery mode to be turned back on. Currently when device crashes it can't be used anymore.
  2. Optional - find the scenario that pushes DMA to it's limit and optimize data flow.

Debug code

diff --git a/drivers/ethernet/eth_mcux.c b/drivers/ethernet/eth_mcux.c
index 2f0223b753..4b3c3423b9 100644
--- a/drivers/ethernet/eth_mcux.c
+++ b/drivers/ethernet/eth_mcux.c
@@ -816,6 +816,7 @@ static int eth_0_init(struct device *dev)
        enet_config.interrupt |= kENET_RxFrameInterrupt;
        enet_config.interrupt |= kENET_TxFrameInterrupt;
        enet_config.interrupt |= kENET_MiiInterrupt;
+       enet_config.interrupt |= kENET_EBusERInterrupt;
 
 #ifdef CONFIG_ETH_MCUX_PROMISCUOUS_MODE
        enet_config.macSpecialConfig |= kENET_ControlPromiscuousEnable;
@@ -1004,6 +1005,10 @@ static void eth_mcux_error_isr(void *p)
        struct eth_context *context = dev->driver_data;
        u32_t pending = ENET_GetInterruptStatus(ENET);
 
+    if (pending & kENET_EBusERInterrupt) {
+        printk("\n\nISR - kENET_EBusERInterrupt\n\n");
+       }
+
        if (pending & ENET_EIR_MII_MASK) {
                k_work_submit(&context->phy_work);
                ENET_ClearInterruptStatus(ENET, kENET_MiiInterrupt);

TODO:

  • test on i.MX RT as ENET device is preset there
  • write recovery code

@agansari
Copy link
Collaborator

agansari commented Jul 12, 2019

Further debugging I found that it's a MPU related issue; ENET device tries to acces via it's uDMA a user space address in RAM. Will continue further debugging on Monday.

Disabling MPU in eth device's initialization bypasses the issue.
I.e. in eth_0_init() add SYSMPU->CESR &= ~SYSMPU_CESR_VLD_MASK;

@agansari
Copy link
Collaborator

@aunsbjerg issue is related to MPU disabling ENET's DMA acces to RAM. See pull #17603

@aunsbjerg
Copy link
Collaborator Author

@agansari I just did a test run with your PR, and it seems to be working perfectly - good job!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Networking bug The issue is a bug, or the PR is fixing a bug platform: NXP NXP priority: high High impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants