Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: drivers: ethernet: build as static library breaks frdm_k64f gptp sample application #38571

Closed
ainguraXmarquiegui opened this issue Sep 15, 2021 · 12 comments · Fixed by #38618
Assignees
Labels
area: Ethernet bug The issue is a bug, or the PR is fixing a bug platform: NXP NXP Regression Something, which was working, does not anymore
Milestone

Comments

@ainguraXmarquiegui
Copy link
Contributor

Describe the bug
Commit e912a05 breaks frdm_k64f gptp sample application. Any build from this version or newer results in the demo not working. GPTP no longer completes initialization.
Bulding code on revision b6855b2 or older results in a working demo.

Latest main code also works if I revert that commit, as can be seen here

To Reproduce
Steps to reproduce the behavior:

git checkout main # Tested on revision 79952ad774caf380fca4d24cd12f397b9c337d29
west update
west -v build -p -b frdm_k64f samples/net/gptp
sudo west flash
# replace enp0s31f6 with your own hardware timestamping supporting nic card interface name
sudo linuxptp/ptp4l -m -f 007-gPTP-PCMasterZephyrSlaveMinChange.cfg -i enp0s31f6

007-gPTP-PCMasterZephyrSlaveMinChange.cfg.txt

Board stays like this forever:

uart:~$ net gptp

SiteSyncSync state machine variables:
        Current state                  : INITIALIZING
        A PortSyncSync struct is ready : no
ClockSlaveSync state machine variables:
        Current state                  : SEND_SYNC_IND
        A PortSyncSync struct is ready : no
        The local clock has expired    : no
PortRoleSelection state machine variables:
        Current state                  : ROLE_SELECTION
ClockMasterSyncReceive state machine variables:
        Current state                  : WAITING
        A ClockSourceTime              : yes
        The local clock has expired    : no

Expected behavior
If I repeat those steps with this code point everything works as expected:

uart:~$ net gptp
Port Interface          Role
 1   0x20000f38 [1]     SLAVE

SiteSyncSync state machine variables:
        Current state                  : RECEIVING_SYNC
        A PortSyncSync struct is ready : no
ClockSlaveSync state machine variables:
        Current state                  : SEND_SYNC_IND
        A PortSyncSync struct is ready : no
        The local clock has expired    : no
PortRoleSelection state machine variables:
        Current state                  : ROLE_SELECTION
ClockMasterSyncReceive state machine variables:
        Current state                  : WAITING
        A ClockSourceTime              : yes
        The local clock has expired    : no

Impact
Big annoyance. The reference board for GPTP tests stops working.

Environment (please complete the following information):

  • OS: Ubuntu 20.04
  • Toolchain: zephyr-sdk-0.13.0 + cmake version 3.20.2
  • Commit SHA: 79952ad
  • NIC on GPTP master PC: Intel I219-LM
@ainguraXmarquiegui ainguraXmarquiegui added the bug The issue is a bug, or the PR is fixing a bug label Sep 15, 2021
@ainguraXmarquiegui
Copy link
Contributor Author

@carlescufi carlescufi added Regression Something, which was working, does not anymore area: Ethernet labels Sep 15, 2021
@dcpleung
Copy link
Member

This is probably due to initialization priority of the drivers and subsys. Could you check to make sure everything required for GPTP to work are initialized in order? For example, make sure that init priorities are not the same for two drivers where one needs the other to work correctly.

@ainguraXmarquiegui
Copy link
Contributor Author

I will try to look into it anyway, but to be honest, I don't know where I can look for that. Any hint on what file/function I should check, or documentation that could help me understand your observation better would be helpful.

I'll report back when I have time to look into it.

@dcpleung
Copy link
Member

Drivers use DEVICE_DEFINE() to specify when they need to be initialized, and subsys uses SYS_INIT(). So these should be good starting point.

@hakehuang hakehuang added this to the v2.7.0 milestone Sep 16, 2021
@hakehuang
Copy link
Collaborator

hakehuang commented Sep 16, 2021

@ainguraXmarquiegui Thanks a lot for finding this issue.

@dcpleung I find change the init level does not help I move the ETH_INIT_PRIORITY to 95 and CONFIG_NET_INIT_PRIO is 90, but it still fails.

if I revert your commit then the pass log shows:

uart:~$ *** Booting Zephyr OS build v2.7.0-rc1-130-g0befa04f607b  ***


[00:00:00.005,000] <wrn> net_if: iface 0x20000f38 is down
[00:00:00.008,000] <inf> net_config: Initializing network
[00:00:00.008,000] <inf> net_config: Waiting interface 1 (0x20000f38) to be up...
[00:00:01.007,000] <wrn> net_gptp: Reset Pdelay requests
[00:00:01.009,000] <wrn> net_if: iface 0x20000f38 is down
[00:00:02.011,000] <wrn> net_gptp: Reset Pdelay requests
[00:00:02.013,000] <wrn> net_if: iface 0x20000f38 is down
[00:00:03.004,000] <inf> net_config: Interface 1 (0x20000f38) coming up
[00:00:03.004,000] <inf> eth_mcux: ETH_0 enabled 100M full-duplex mode.
[00:00:03.004,000] <inf> net_config: IPv4 address: 192.0.2.1
[00:00:03.015,000] <wrn> net_gptp: Reset Pdelay requests
[00:00:03.105,000] <inf> net_config: IPv6 address: 2001:db8::1
[00:00:04.018,000] <wrn> net_gptp: Reset Pdelay requests
[00:00:05.743,000] <dbg> net_gptp_sample.gptp_phase_dis_cb: GM D4:BE:D9:FF:FE:0C:72:BB last phase 0.78567836746728

whereas the failure log shows

[00:00:00.007,000] <inf> net_config: Initializing network
[00:00:00.007,000] <inf> net_config: Waiting interface 1 (0x20000f38) to be up...
[00:00:03.003,000] <inf> net_config: Interface 1 (0x20000f38) coming up
[00:00:03.003,000] <inf> eth_mcux: ETH_0 enabled 100M full-duplex mode.
[00:00:03.003,000] <inf> net_config: IPv4 address: 192.0.2.1
[00:00:03.104,000] <inf> net_config: IPv6 address: 2001:db8::1
[00:01:13.011,000] <inf> eth_mcux: ETH_0 link down
[00:01:16.011,000] <inf> eth_mcux: ETH_0 enabled 100M full-duplex mode.
[00:01:16.111,000] <inf> net_config: IPv6 address: 2001:db8::1

if build out the eth drivers below function will have problem

static inline void init_iface(struct net_if *iface)
{
	const struct net_if_api *api = net_if_get_device(iface)->api;

	if (!api || !api->init) {
		NET_ERR("Iface %p driver API init NULL", iface);
		return;
	}
...
}

image

@dcpleung
Copy link
Member

I don't have the board so I cannot debug it on hardware. But from what I can see in the built binary using main, the net_if_area and net_if_dev_area sections are populated. I can't see why net gptp in shell showed no interfaces at all. Is there a command to show all network interfaces? Could you try that?

@dleach02
Copy link
Member

dleach02 commented Sep 16, 2021

I have the board and I can see what the problem is. The is an implied dependency between two devices being brought up in POST_KERNEL. As @dcpleung indicated, they have the same priority level within the POST_KERNEL so the order their init functions get called can not be depended on.

The two devices are net_core.c:net_init() and the ptp clock mux in eth_mux.c. PTP clock needs to be initialized before the net_init occurs. The net_init flow fails to add the PTP interface because the clock hasn't been assigned yet.

I'm prepping a PR.

dleach02 added a commit to nxp-zephyr/zephyr that referenced this issue Sep 16, 2021
The net_core device initialization has a subtle dependency
on the PTP clock initialization. Adding a Kconfig and set
it to a priority level less than net_core. This will ensure
the initialization sequence.

Fixes zephyrproject-rtos#38571

Signed-off-by: David Leach <david.leach@nxp.com>
@dleach02
Copy link
Member

@ainguraXmarquiegui I posted a PR. Can you verify in your environment?

@ainguraXmarquiegui
Copy link
Contributor Author

I will have access to the board on Monday. I will test your PR that same day, if that's OK.

@hakehuang
Copy link
Collaborator

the fix works on my frdm_k64f boards, but there are two questions:

  1. usually linker place the init fucntion in alpha-beta order, so enet is always in ahead of net, however, when ethnet driver build as library this rules does not apply.

  2. Now the init order is ptp timer first -> net core -> eth driver. which means the eth driver eth_iface_init is called before eth_init. it looks ok for now, but not sure whether there will be a potential issue

carlescufi pushed a commit that referenced this issue Sep 17, 2021
The net_core device initialization has a subtle dependency
on the PTP clock initialization. Adding a Kconfig and set
it to a priority level less than net_core. This will ensure
the initialization sequence.

Fixes #38571

Signed-off-by: David Leach <david.leach@nxp.com>
@dleach02
Copy link
Member

@hakehuang, The order is by section name where the individual device inits are assigned a section based on run-level and priority. If multiple devices have the same run-level and priority it is explicitly undefined which order those will occur. We have seen multiple instances where some change with the build system causes the linker to change the order of these.

@ainguraXmarquiegui
Copy link
Contributor Author

@ainguraXmarquiegui I posted a PR. Can you verify in your environment?

You've already had confirmation from others, but just to keep my promise I have verified it on my own hardware. It seems to work. Thank you for the fix!

*** Booting Zephyr OS build v2.7.0-rc2-12-g9fa1f6edcace  ***                                                                   
                                                                                                                                       
                                                                                                                                       
[00:00:00.005,000] <wrn> net_if: iface 0x20000f38 is down                                                                              
[00:00:00.008,000] <inf> net_config: Initializing network                                                                              
[00:00:00.008,000] <inf> net_config: Waiting interface 1 (0x20000f38) to be up...                                                      
[00:00:01.007,000] <wrn> net_gptp: Reset Pdelay requests                                                                               
[00:00:01.009,000] <wrn> net_if: iface 0x20000f38 is down                                                                              
[00:00:02.011,000] <wrn> net_gptp: Reset Pdelay requests                                                                               
[00:00:02.013,000] <wrn> net_if: iface 0x20000f38 is down                                                                              
[00:00:03.002,000] <inf> net_config: Interface 1 (0x20000f38) coming up                                                                
[00:00:03.002,000] <inf> eth_mcux: ETH_0 enabled 100M full-duplex mode.                                                                
[00:00:03.003,000] <inf> net_config: IPv4 address: 192.0.2.1                                                                           
[00:00:03.014,000] <wrn> net_gptp: Reset Pdelay requests                                                                               
[00:00:03.103,000] <inf> net_config: IPv6 address: 2001:db8::1                                                                         
[00:00:05.887,000] <dbg> net_gptp_sample.gptp_phase_dis_cb: GM B0:5C:DA:FF:FE:31:16:26 last phase 0.78585016615912                     
uart:~$ net gptp                                                                                                                       
Port Interface          Role                                                                                                           
 1   0x20000f38 [1]     SLAVE                                                                                                          
                                                                                                                                       
SiteSyncSync state machine variables:                                                                                                  
        Current state                  : RECEIVING_SYNC                                                                                
        A PortSyncSync struct is ready : no                                                                                            
ClockSlaveSync state machine variables:                                                                                                
        Current state                  : SEND_SYNC_IND                                                                                 
        A PortSyncSync struct is ready : no                                                                                            
        The local clock has expired    : no                                                                                            
PortRoleSelection state machine variables:                                                                                             
        Current state                  : ROLE_SELECTION                                                                                
ClockMasterSyncReceive state machine variables:                                                                                        
        Current state                  : WAITING                                                                                       
        A ClockSourceTime              : yes                                                                                           
        The local clock has expired    : no                                                                                            
uart:~$ 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Ethernet bug The issue is a bug, or the PR is fixing a bug platform: NXP NXP Regression Something, which was working, does not anymore
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants