Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENC28J60: dns resolve fails after few minutes uptime #54199

Closed
joelguittet opened this issue Jan 28, 2023 · 21 comments
Closed

ENC28J60: dns resolve fails after few minutes uptime #54199

joelguittet opened this issue Jan 28, 2023 · 21 comments
Labels
area: Ethernet area: Networking bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug Stale

Comments

@joelguittet
Copy link
Contributor

joelguittet commented Jan 28, 2023

Describe the bug

I'm running an HTTP client application on STM32L4A6 Nucleo board + ENC28J60 module.
After few minutes, the DNS resolve fails to query addresses. I have observed this in my application but also in the console just doing net dns query google.fr plenty of time up to the first failure, which occurs after few minutes only after starting the application.

In my application getaddrinfo function returns error code -101.

My prj.conf particularly contains:

CONFIG_DNS_RESOLVER=y
CONFIG_DNS_RESOLVER_LOG_LEVEL_DBG=y

To Reproduce
Steps to reproduce the behavior:

  1. On nucleo_l4a6 + enc28j60 build a projet with Ethernet support
  2. in the console type net dns query google.fr

Expected behavior
DNS resolution working.

Impact
Blocking the development.

Logs and console output

In the console and after activating debug logs I see it is because of timeout:

[00:01:08.776,000] <dbg> net_dns_resolve: dns_resolve_name: (): DNS id will be 34471
[00:01:08.776,000] <dbg> net_dns_resolve: dns_write: (): [0] submitting work to server idx 0 for id 34471 hash 24745
[00:01:10.776,000] <dbg> net_dns_resolve: query_timeout: (sysworkq): Query timeout DNS req 34471 type 1 hash 24745
[00:01:10.776,000] <dbg> net_dns_resolve: dns_resolve_cancel_with_hash: (sysworkq): Cancelling DNS req 34471 (name my-server.com type 1 hash 24745)

Environment (please complete the following information):

  • OS: Linux
  • Toolchain: Zephyr SDK
  • I'm actually at zephyr commit ID e9a5951 (v3.3.0-rc1 tag).

Additional context
DHCPv4 activated, no IPv6.
IP address is provided by the local router. DNS is provided at the same time, it's the local router (192.168.1.1) and the command net dns properly show the DNS address before and after the issue occurs.
Before the issue occur it is possible to do some ping in the console. After the issue occurs pings in the console fails with timeout.
The same application has been executed on stm32f746g_disco board (RMII Ethernet PHY) with no issues => the problem comes from ENC28J60 support.

Thanks for any support,
Joel

@joelguittet joelguittet added the bug The issue is a bug, or the PR is fixing a bug label Jan 28, 2023
@RomainPelletant
Copy link
Contributor

Hi @joelguittet,

Can you describe you network configuration please? (DHCP/IP static adress/DNS entries..)

@joelguittet
Copy link
Contributor Author

joelguittet commented Jan 30, 2023

Hello @RomainPelletant

Sure. I have DHCPv4 activated; no static IP addres. DNS server is the local router (properly shown in the shell) at 192.168.1.1.
Information added in the post above.

I just ran the dns_resolve sample on an other board (with no enc28j60 so the config is much different). It's properly working.
In the dns_resolve sample I notice there is a specific configuration as follow:

# Enable the DNS resolver
CONFIG_DNS_RESOLVER=y
# Enable additional buffers
CONFIG_DNS_RESOLVER_ADDITIONAL_BUF_CTR=5
# Enable additional queries
CONFIG_DNS_RESOLVER_ADDITIONAL_QUERIES=2
# Enable mDNS support
CONFIG_MDNS_RESOLVER=y
# Enable LLMNR support
CONFIG_LLMNR_RESOLVER=n

CONFIG_DNS_RESOLVER_MAX_SERVERS=2
CONFIG_DNS_SERVER_IP_ADDRESSES=y
CONFIG_DNS_NUM_CONCUR_QUERIES=5

But the documentation at https://docs.zephyrproject.org/latest/connectivity/networking/api/dns_resolve.html do not really explains the details of the CONFIG. Just to check the KConfig which is not enough I think.

I should maybe play with these settings, right ?

Regards,

@RomainPelletant
Copy link
Contributor

If I am not wrong, DHCP shall add the right DNS entry (in your case 192.168.1.1).
To know where is the problem, you could add fix DNS entry like the following

CONFIG_DNS_SERVER_IP_ADDRESSES=y
CONFIG_DNS_SERVER1="192.168.1.1"
CONFIG_DNS_SERVER2="8.8.8.8"

Using the IP directly is working?

@joelguittet
Copy link
Contributor Author

Seems this help, but after a quick moment, then I'm no more able to query DNS resolution. Using the shell, timeout or -11 error is reported. I will try to use some configuration settings from the dns_resolve sample.

@stephanosio stephanosio added the platform: STM32 ST Micro STM32 label Jan 31, 2023
@stephanosio stephanosio added the priority: low Low impact/importance bug label Jan 31, 2023
@joelguittet
Copy link
Contributor Author

Using configuration of the DNS from dns_resolve sample I got the same result: DNS resolution working at the beginning, but failing after few minutes.

Maybe it's due to enc28j60 driver ? Don't know how to determine this :-(

@Desvauxm-st
Copy link
Contributor

Desvauxm-st commented Feb 2, 2023

Hi @joelguittet
The issue you face is likely due to enc28j60 driver and not due to STM32
May be we would have to see with the maintainer ethernet: @tbursztyka or maintainer net: @rlubos ?

@erwango erwango added area: Ethernet and removed platform: STM32 ST Micro STM32 labels Feb 2, 2023
@erwango erwango assigned tbursztyka and rlubos and unassigned erwango Feb 2, 2023
@rlubos
Copy link
Contributor

rlubos commented Feb 3, 2023

I'd check if the DNS server address does not get corrupted somehow (you can list configured servers with net dns).

Otherwise, if it's a driver issue, I'm not really of much use as I don't have experience with that particular one. Are you able to send any other packets when DNS starts to malfunction (for example with net tcp command)?

@joelguittet
Copy link
Contributor Author

Hello

net dns still show the DNS address after the issue occurs.
I have checked if something else can be done using the net ping 8.8.8.8 in the console. Before the issue occurs it's working. After the issue occurs I get "Ping timeout" error.

Note: When I opened the ticket the issue occurred 1 minute after uptime. But it seems to be generally 3 or 4 minutes after uptime.

It seems the issue is related to the ENC28J60 driver but it's strange I'm the first to report it. Isn't it ? Or maybe it is not used a lot ? I should receive an ESP01 module soon, I will check with this module if I get the error with my hardware and my application, this will confirm (or not) if this is due to enc28j60 driver.

Joel

@joelguittet
Copy link
Contributor Author

Hello

I have build and ran my application on stm32f746g_disco board, which has RMII LAN8742A Ethernet PHY without any issue. The difference is only the Ethernet PHY. The configuration used was the same (DHCPv4, dynamic DNS address...)

We can conclude this is an issue with the ENC28J60 driver. I'm updating the first post of this thread to indicate this.

Joel

@carlescufi
Copy link
Member

We can conclude this is an issue with the ENC28J60 driver. I'm updating the first post of this thread to indicate this.

Unfortunately this driver is currently unmaintained, so we need a volunteer to step up and provide a fix. Unassigning @rlubos since he is the maintainer of the networking stack and not individual drivers.

@joelguittet
Copy link
Contributor Author

@carlescufi no problem I understand this. Will try W5500 and ESP01 as a replacement for my project. I'm not enough experienced with zephyr to start this kind of deep debugging on the networking parts :-)

@RomainPelletant
Copy link
Contributor

@joelguittet FYI i am using eth_enc424j600 without any DNS issues

@joelguittet joelguittet changed the title dns resolve: fails after 60 seconds uptime ENC28J60: dns resolve fails after few minutes uptime Feb 8, 2023
@joelguittet
Copy link
Contributor Author

Thanks @RomainPelletant I just received this afternoon a W5500 module. Wired, reconfigured in the dts. Working. No modification of the app again. This is a double confirmation enc28j60 has issues.

@jfischer-no
Copy link
Collaborator

I briefly tested it on nRF52840DK today. I used zperf sample (net ping 2001:db8::affe -c 1000 -i 10 -s 1000) and see MPU FAULT <err> eth_enc28j60: Failed to read memory somewhere in buf_simple. And this reminds me of a weird bug that seemed to be a controller RX FIFO overflow that I fought in this driver few years ago, I finally gave up and implemented enc424j600 driver. If anyone wants to spend time on this, please reassign.

@github-actions
Copy link

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

@github-actions github-actions bot added the Stale label Jun 13, 2023
@jgl-meta jgl-meta removed the Stale label Jun 20, 2023
@joelguittet
Copy link
Contributor Author

Hello

Working with W5500 interface since several months now I realize I now get some issues (my application is growing...)
And the reason is pretty simple and it was probably the root cause of the ENC28J60 issue here : this kind of chip offload the sockets but it's not integrated in Zephyr using offloading possibilities (see for example eswifi of winc1500 drivers in Zephyr).

As a consequence only a single socket can be used at a time.

Driver to be reworked to permit offloading the sockets instead of just using the first socket in a raw-access mode.

Joel

@jfischer-no
Copy link
Collaborator

Working with W5500 interface since several months now I realize I now get some issues (my application is growing...)
And the reason is pretty simple and it was probably the root cause of the ENC28J60 issue here : this kind of chip offload the sockets but it's not integrated in Zephyr using offloading possibilities (see for example eswifi of winc1500 drivers in Zephyr).

This is ingenious causal inference.

@github-actions
Copy link

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

@github-actions github-actions bot added the Stale label Sep 26, 2023
@jhedberg
Copy link
Member

@jfischer-no are you planning to work on this? (it's getting the Stale label auto-added)

@jfischer-no
Copy link
Collaborator

@jfischer-no are you planning to work on this? (it's getting the Stale label auto-added)

#54199 (comment)
I do not think it is worth it.

@jfischer-no jfischer-no removed their assignment Sep 26, 2023
@jhedberg
Copy link
Member

@jfischer-no are you planning to work on this? (it's getting the Stale label auto-added)

#54199 (comment) I do not think it is worth it.

Ok, let’s close this then

@jhedberg jhedberg closed this as not planned Won't fix, can't repro, duplicate, stale Sep 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Ethernet area: Networking bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug Stale
Projects
None yet
Development

No branches or pull requests