Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IP stack can't recover from a packet overload #7571

Closed
therealprof opened this issue May 16, 2018 · 56 comments

Comments

Projects
None yet
6 participants
@therealprof
Copy link
Contributor

commented May 16, 2018

I've connected my Nucleo-F429ZI to a reasonably busy network (i.e. a constant flow of broadcast messages which also end up at the network interface of the MCU) and it very soon ends up receiving more packets than it can handle but the main problem is that even if the load lightens the network application doesn't receive any packets anymore and renders the system useless.

After the first occurrence of this error message the application is dead:

[net/buf] [ERR] net_buf_alloc_len_debug: net_pkt_get_reserve_data():359: Failed to get free buffer

I tried increasing buffers and stack but to no avail. Not sure which information I can provide to aid in debugging but it is 100% reproducible (and hence very annoying ;) ):

shell> net stats

Global statistics
=================
IPv4 recv      579      sent    12      drop    569     forwarded       0
IP vhlerr      0        hblener 0       lblener 0
IP fragerr     0        chkerr  0       protoer 0
ICMP recv      3        sent    5       drop    0
ICMP typeer    0        chkerr  0
UDP recv       7        sent    5       drop    450
UDP chkerr     0
Bytes received 152255
Bytes sent     1237
Processing err 489
shell> net stats all

Interface 0x2000a140 (Ethernet) [0]
===================================
IPv4 recv      579      sent    12      drop    569     forwarded       0
IP vhlerr      0        hblener 0       lblener 0
IP fragerr     0        chkerr  0       protoer 0
ICMP recv      3        sent    5       drop    0
ICMP typeer    0        chkerr  0
UDP recv       7        sent    5       drop    450
UDP chkerr     0
Bytes received 152315
Bytes sent     1237
Processing err 490
shell> [net/buf] [ERR] net_buf_alloc_len_debug: net_pkt_get_reserve_data():360: Failed to get free buffer
[net/buf] [ERR] net_buf_alloc_len_debug: net_pkt_get_reserve_data():360: Failed to get free buffer
[net/buf] [ERR] net_buf_alloc_len_debug: net_pkt_get_reserve_data():360: Failed to get free buffer

shell> net stats

Global statistics
=================
IPv4 recv      2954     sent    12      drop    2944    forwarded       0
IP vhlerr      0        hblener 0       lblener 0
IP fragerr     0        chkerr  0       protoer 0
ICMP recv      3        sent    5       drop    0
ICMP typeer    0        chkerr  0
UDP recv       7        sent    5       drop    2190
UDP chkerr     0
Bytes received 834305
Bytes sent     1237
Processing err 2716
@mike-scott

This comment has been minimized.

Copy link
Contributor

commented May 16, 2018

Can you include a copy of the .config for the app yiu have loaded? As well that the buffer setyings you've already tried raising?

@therealprof

This comment has been minimized.

Copy link
Contributor Author

commented May 16, 2018

@therealprof

This comment has been minimized.

Copy link
Contributor Author

commented May 16, 2018

I tried bumping all values I could find in the networking area without any noticeable effect but I'm happy to try specific changes you'd like me to try.

@mike-scott

This comment has been minimized.

Copy link
Contributor

commented May 16, 2018

The key value for RX buffers would be:
CONFIG_NET_BUF_RX_COUNT=10

Also, enable the following debugs:
CONFIG_NET_LOG=y
CONFIG_SYS_LOG_NET_LEVEL=4
CONFIG_NET_DEBUG_NET_PKT=y

Once added, you can use the following 2 commands for debugging in the console shell:

  1. net mem: displays the status of the net packets and buffers (allocated vs. freed)
  2. net allocs: shows a kind of stack trace for each packet/buffer indicating when it was last freed or if it's still allocated, where it was allocated.
@mike-scott

This comment has been minimized.

Copy link
Contributor

commented May 16, 2018

I think we're actually not freeing back up buffers on an error path, so they're staying allocated.

@therealprof

This comment has been minimized.

Copy link
Contributor Author

commented May 16, 2018

I'll try that tomorrow in the office.

@therealprof

This comment has been minimized.

Copy link
Contributor Author

commented May 17, 2018

I've put a full capture of the debugging log as well as the commands at the end here:

https://gist.github.com/therealprof/6a7263c6633b461be79ad270636b0468

Increase of the RX buffers made no difference.

@mike-scott

This comment has been minimized.

Copy link
Contributor

commented May 17, 2018

Wow, there are packets in your log which are running > 1KB data size. Is the eth port in promiscuous mode?

Can you run "net allocs" once you run out of buffers?

@therealprof

This comment has been minimized.

Copy link
Contributor Author

commented May 17, 2018

Wow, there are packets in your log which are running > 1KB data size. Is the eth port in promiscuous mode?

No, just a regular office network.

Can you run "net allocs" once you run out of buffers?

It's at the bottom of the log.

@jukkar

This comment has been minimized.

Copy link
Member

commented May 17, 2018

I think we're actually not freeing back up buffers on an error path, so they're staying allocated.

I think you are right here. Increasing the number bufs does not really help but just postpones the inevitable out-of-bufs issue.

@jukkar

This comment has been minimized.

Copy link
Member

commented May 17, 2018

Are you actively sending something to zephyr device, or is it mostly some multicast / broadcast traffic?
Can you check using wireshark what kind of traffic you have in the network, the log does not really reveal that with these debug options. I am just trying to pin point the code where to start debugging this.

@jukkar

This comment has been minimized.

Copy link
Member

commented May 17, 2018

According to the log, you are not running out of buffers.

The "net allocs" command tells that all the buffers are freed just fine.
Also at the end of the log there is a print:
[net/pkt] [DBG] net_pkt_print: (0x20004358): TX 10 RX 50 RDATA 50 TDATA 10
which tells how many buffers are available and they are the same as how many were allocated at build time:
net_pkt_init: (0x20006060): Allocating 50 RX (3600 bytes), 10 TX (720 bytes), 50 RX data (0 bytes) and 10 TX data (0 bytes) buffers

Also according to the log the device never fully run out of buffers
[net/pkt] [DBG] net_pkt_print: (0x20004358): TX 10 RX 15 RDATA 1 TDATA 10
so there was always at least one net_buf buffer available (128 bytes) for incoming data. The RDATA or RX field would be 0 if we run out of buffers or have buffer leak.

@jukkar

This comment has been minimized.

Copy link
Member

commented May 17, 2018

My bad, there was out-of-buffers situations in the log:

[net/buf] [ERR] net_buf_alloc_len_debug: net_pkt_get_reserve_data():359: Failed to get free buffer

but according to the log, the system recovered that successfully.

@therealprof

This comment has been minimized.

Copy link
Contributor Author

commented May 17, 2018

Are you actively sending something to zephyr device, or is it mostly some multicast / broadcast traffic?
Can you check using wireshark what kind of traffic you have in the network, the log does not really reveal that with these debug options. I am just trying to pin point the code where to start debugging this.

Well, yes, I'm trying to use LwM2M but most of the traffic is actually broadcasts in the network.

[net/buf] [ERR] net_buf_alloc_len_debug: net_pkt_get_reserve_data():359: Failed to get free buffer
but according to the log, the system recovered that successfully.

No it does not. The application (lwm2m-client) doesn't receive any traffic any more after one of those messages. Actually earlier, but when this message appears you know it has happened and it's time for a reboot.

I am very convinced it is due to all the traffic in the network because the same device will run much longer in a different environment, e.g. my home. At the office it sometimes even fails to register at the lwm2m server but at the lastest 3 minutes after boot it is not communicating anymore.

@tbursztyka

This comment has been minimized.

Copy link
Collaborator

commented May 17, 2018

Hum RX are going lower and lower, and worse RDATA is a lot near 0 (thus why sometimes it is unable to get enough frags for the received packet). Looks like the host has not enough time to consume received packets. Sounds like RX queue has a too low priority maybe. @jukkar how can this be tweaked now? (we used to have one rx thread but now it's slightly different as we have rx queue(s). )

Also, frags are 128 bytes, but that's quite small for ethernet. Could you try raising CONFIG_NET_BUF_DATA value? to 512 for instance (you could reduce the number of CONFIG_NET_BUF_RX_COUNT at same time, you won't need that many anymore).

@tbursztyka

This comment has been minimized.

Copy link
Collaborator

commented May 17, 2018

Note that, even if rx queue priority could fix your issue, there is no buffer bloat mitigation logic in Zephyr's net stack. So it is still possible to flood it (knowing that it runs on a board with a limited amount of RAM with a bearer like Ethernet)

@therealprof

This comment has been minimized.

Copy link
Contributor Author

commented May 17, 2018

Also, frags are 128 bytes, but that's quite small for ethernet. Could you try raising CONFIG_NET_BUF_DATA value? to 512 for instance (you could reduce the number of CONFIG_NET_BUF_RX_COUNT at same time, you won't need that many anymore).

So the warnings disappear but the data flow still stops. Maybe there's some other foul play here.

@mike-scott

This comment has been minimized.

Copy link
Contributor

commented May 17, 2018

@tbursztyka I guess the core problem is that the ethernet driver can't know if the packet is wanted or unwanted until the fully formed(allocated) packet+buffers reaches the ip management / connection code ... which could certainly lead to flooding as you mentioned if the RX queue can't process the packets fast enough to free back up the net_pkt/net_bufs for more incoming packets.

Examining the ethernet case: could we send the "destination MAC" from the very first buffer to a new L2 (ethernet specific) function which could determine if the entire frame could be tossed away instead of allocating any more buffers.

The downside is that each ethernet driver would need to make that call prior to calling net_recv_data.

@tbursztyka

This comment has been minimized.

Copy link
Collaborator

commented May 17, 2018

@mike-scott What you describe is the promiscuous mode. Afaik most controllers, if not all in fact, filters packet on the dst mac: if it's not for them, they drop it. That said, if the feature is not enabled or badly configured on driver level, then that won't work. Could be the case here, I don't know.

@therealprof The frag size tweaking is just to use memory in a bit better way (less frags so less overhead of net_buf struct), but there is still the issue of rx packet consumption.

@mike-scott

This comment has been minimized.

Copy link
Contributor

commented May 17, 2018

@tbursztyka 1KB broadcast packets seem large. I guess we would need some debugging to make sure these packets are valid.

@jukkar

This comment has been minimized.

Copy link
Member

commented May 18, 2018

No it does not. The application (lwm2m-client) doesn't receive any traffic any more after one of those messages. Actually earlier, but when this message appears you know it has happened and it's time for a reboot.

I meant to say that from memory point of view the system recovers this low memory issue just fine (according to provided logs). It is totally different matter what happens in other parts of the zephyr (be it in IP stack or application).

@jukkar

This comment has been minimized.

Copy link
Member

commented May 18, 2018

Also, frags are 128 bytes, but that's quite small for ethernet. Could you try raising CONFIG_NET_BUF_DATA value? to 512 for instance (you could reduce the number of CONFIG_NET_BUF_RX_COUNT at same time, you won't need that many anymore).

I would argue that the system should work better with 128 byte buffer than with 512. This depends on the network traffic of course but with smaller buffers the network packets are probably better fit into net_buf and we do not run out of buffers easier.

Note that we have a samples/net/throughput_server sample app for measuring UDP traffic throughput, you could try that one too and see what happens with your board. See the readme file in that sample for usage details.

@jukkar

This comment has been minimized.

Copy link
Member

commented May 18, 2018

Note that, even if rx queue priority could fix your issue, there is no buffer bloat mitigation logic in Zephyr's net stack. So it is still possible to flood it (knowing that it runs on a board with a limited amount of RAM with a bearer like Ethernet)

We cannot prevent flooding but we should recover from it of course. For testing this one can use the throughput_server app I mentioned earlier.

@jukkar

This comment has been minimized.

Copy link
Member

commented May 18, 2018

In order to get some idea what goes wrong and in what part of the system, could you enable some debugging options and post the log somewhere.
We could try first with

CONFIG_NET_LOG_GLOBAL=y
CONFIG_SYS_LOG_LWM2M_LEVEL=4
CONFIG_SYS_LOG_NET_LEVEL=4

That setting will print lot of stuff and will require lot of flash so we might need to tweak that a bit but we can start with this one.

@therealprof

This comment has been minimized.

Copy link
Contributor Author

commented May 18, 2018

@jukkar

No worries,

Memory region         Used Size  Region Size  %age Used
           FLASH:       98244 B         2 MB      4.68%
            SRAM:       41152 B       256 KB     15.70%
        IDT_LIST:         244 B         2 KB     11.91%

However I can only test next Wednesday (if I don't forget), won't be at the office before.

@jukkar

This comment has been minimized.

Copy link
Member

commented May 18, 2018

Was that for lwm2m_client? When I compile for nucleo_f429zi I get

Memory region         Used Size  Region Size  %age Used
           FLASH:      163492 B         2 MB      7.80%
            SRAM:       46336 B       256 KB     17.68%
        IDT_LIST:         244 B         2 KB     11.91%

Only thing I added to prj.conf file in that sample was the CONFIG_NET_LOG_GLOBAL=y

jukkar added a commit to jukkar/zephyr that referenced this issue Aug 14, 2018

net: l2/lib: Always have a timeout when allocating a net_buf
Instead of waiting forever for a network buffer, have a timeout
when allocating net_buf. This way we cannot left hanging for a
long time waiting for a buffer and possibly deadlock the system.
This commit adds checks to L2 and network support libraries.

Fixes zephyrproject-rtos#7571

Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>

@jukkar jukkar closed this in 7c7cfdd Aug 14, 2018

jukkar added a commit that referenced this issue Aug 14, 2018

net: l2/lib: Always have a timeout when allocating a net_buf
Instead of waiting forever for a network buffer, have a timeout
when allocating net_buf. This way we cannot left hanging for a
long time waiting for a buffer and possibly deadlock the system.
This commit adds checks to L2 and network support libraries.

Fixes #7571

Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
@pfalcon

This comment has been minimized.

Copy link
Collaborator

commented Aug 14, 2018

It's not proven that 7c7cfdd fixes this, reopening.

@pfalcon pfalcon reopened this Aug 14, 2018

@pfalcon

This comment has been minimized.

Copy link
Collaborator

commented Aug 21, 2018

@therealprof: Are you back from vacation now? Can you please retest this to see if there're any improvements? (With my usual tests, I don't see any so far, so we should keep pulling on this thing.)

@therealprof

This comment has been minimized.

Copy link
Contributor Author

commented Aug 22, 2018

@pfalcon Yes I am. Unfortunately I don't have the Nucleo-F429ZI here with me in the "hostile" environment and the FRDM-K64F build is broken, too. :(

@pfalcon

This comment has been minimized.

Copy link
Collaborator

commented Aug 22, 2018

@therealprof

This comment has been minimized.

Copy link
Contributor Author

commented Aug 22, 2018

@pfalcon

This comment has been minimized.

Copy link
Collaborator

commented Aug 22, 2018

@therealprof, Btw, if you still experience issue, I recommend applying and enabling (CONFIG_SYS_LOG_DBG_ERR=y) my cute patch: #8769 , and see if there're any correlation with the state it gets into and logging printed.

@therealprof

This comment has been minimized.

Copy link
Contributor Author

commented Aug 23, 2018

@pfalcon I tried on my FRDM_K64F in the office today (aka hostile networking environment) and it was running extremely unstable (lots of buffer problems and faults: "Imprecise data bus error") though I'm not convinced it is the same issue as before. I guess I need to crank up the logging and try again on Monday.

@nashif nashif modified the milestone: v1.13.0 Aug 26, 2018

@nashif nashif added area: LWM2M and removed area: LWM2M labels Aug 26, 2018

@pfalcon

This comment has been minimized.

Copy link
Collaborator

commented Aug 28, 2018

@therealprof

it was running extremely unstable (lots of buffer problems and faults: "Imprecise data bus error")

Can you please give more details on how to reproduce it? I assume it's still lwm2m_client sample. I have neither busy Ethernet framework nor even LwM2M server here, and I don't get faults running from today's master. Perhaps, the faults are exactly dependent on the network traffic you have? Any clarifications/step to reproduce would be appreciated.

***** Booting Zephyr OS v1.12.0-1654-g3b80998ff2 *****
Zephyr Shell, Zephyr version: 1.13.0-rc1
Type 'help' for a list of available commands
shell> [lib/lwm2m_engine] [DBG] lwm2m_engine_init: LWM2M engine thread started
[lwm2m_obj_security] [DBG] security_create: Create LWM2M security instance: 0
[lwm2m_obj_server] [DBG] server_create: Create LWM2M server instance: 0
[lwm2m_obj_device] [DBG] device_create: Create LWM2M device instance: 0
[lwm2m_obj_firmware] [DBG] firmware_create: Create LWM2M firmware instance: 0
[lwm2m-client] [INF] main: Run LWM2M client
[lib/lwm2m_engine] [DBG] lwm2m_engine_set: path:3/0/9, value:0x20007207, len:1
[lib/lwm2m_engine] [DBG] lwm2m_engine_set: path:3/0/10, value:0x20007204, len:4
[lib/lwm2m_engine] [DBG] lwm2m_engine_set: path:3/0/20, value:0x20007207, len:1
[lib/lwm2m_engine] [DBG] lwm2m_engine_set: path:3/0/21, value:0x20007204, len:4
[lib/lwm2m_engine] [DBG] lwm2m_engine_create_obj_inst: path:3303/0
[ipso_temp_sensor] [DBG] temp_sensor_create: Create IPSO Temperature Sensor instance: 0
[lib/lwm2m_engine] [DBG] lwm2m_engine_set: path:3303/0/5700, value:0x20007220, len:8
[lib/lwm2m_engine] [DBG] lwm2m_engine_create_obj_inst: path:3311/0
[ipso_light_control] [DBG] light_control_create: Create IPSO Light Control instance: 0
[lib/lwm2m_rd_client] [INF] lwm2m_rd_client_start: LWM2M Client: frdm_k64f
[lib/lwm2m_rd_client] [INF] sm_do_init: RD Client started with endpoint 'frdm_k64f' and client lifetime 0
[lib/lwm2m_rd_client] [DBG] sm_send_registration: registration sent [2001:db8::2]
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_rd_client] [WRN] do_registration_timeout_cb: Registration Timeout
[lwm2m-client] [DBG] rd_client_event: Disconnected
[lwm2m-client] [DBG] rd_client_event: Registration failure!
[lib/lwm2m_rd_client] [INF] sm_do_init: RD Client started with endpoint 'frdm_k64f' and client lifetime 30
[lib/lwm2m_rd_client] [DBG] sm_send_registration: registration sent [2001:db8::2]
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954

shell> [lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_rd_client] [WRN] do_registration_timeout_cb: Registration Timeout
[lwm2m-client] [DBG] rd_client_event: Disconnected
[lwm2m-client] [DBG] rd_client_event: Registration failure!
[lib/lwm2m_rd_client] [INF] sm_do_init: RD Client started with endpoint 'frdm_k64f' and client lifetime 30
[lib/lwm2m_rd_client] [DBG] sm_send_registration: registration sent [2001:db8::2]
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_rd_client] [WRN] do_registration_timeout_cb: Registration Timeout
[lwm2m-client] [DBG] rd_client_event: Disconnected
[lwm2m-client] [DBG] rd_client_event: Registration failure!
[lib/lwm2m_rd_client] [INF] sm_do_init: RD Client started with endpoint 'frdm_k64f' and client lifetime 30
[lib/lwm2m_rd_client] [DBG] sm_send_registration: registration sent [2001:db8::2]
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_rd_client] [WRN] do_registration_timeout_cb: Registration Timeout
[lwm2m-client] [DBG] rd_client_event: Disconnected
[lwm2m-client] [DBG] rd_client_event: Registration failure!
[lib/lwm2m_rd_client] [INF] sm_do_init: RD Client started with endpoint 'frdm_k64f' and client lifetime 30
[lib/lwm2m_rd_client] [DBG] sm_send_registration: registration sent [2001:db8::2]
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_rd_client] [WRN] do_registration_timeout_cb: Registration Timeout
[lwm2m-client] [DBG] rd_client_event: Disconnected
[lwm2m-client] [DBG] rd_client_event: Registration failure!
[lib/lwm2m_rd_client] [INF] sm_do_init: RD Client started with endpoint 'frdm_k64f' and client lifetime 30
[lib/lwm2m_rd_client] [DBG] sm_send_registration: registration sent [2001:db8::2]
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_rd_client] [WRN] do_registration_timeout_cb: Registration Timeout
[lwm2m-client] [DBG] rd_client_event: Disconnected
[lwm2m-client] [DBG] rd_client_event: Registration failure!
[lib/lwm2m_rd_client] [INF] sm_do_init: RD Client started with endpoint 'frdm_k64f' and client lifetime 30
[lib/lwm2m_rd_client] [DBG] sm_send_registration: registration sent [2001:db8::2]
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_rd_client] [WRN] do_registration_timeout_cb: Registration Timeout
[lwm2m-client] [DBG] rd_client_event: Disconnected
[lwm2m-client] [DBG] rd_client_event: Registration failure!
[lib/lwm2m_rd_client] [INF] sm_do_init: RD Client started with endpoint 'frdm_k64f' and client lifetime 30
[lib/lwm2m_rd_client] [DBG] sm_send_registration: registration sent [2001:db8::2]
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
[lib/lwm2m_rd_client] [WRN] do_registration_timeout_cb: Registration Timeout
[lwm2m-client] [DBG] rd_client_event: Disconnected
[lwm2m-client] [DBG] rd_client_event: Registration failure!
[lib/lwm2m_rd_client] [INF] sm_do_init: RD Client started with endpoint 'frdm_k64f' and client lifetime 30
[lib/lwm2m_rd_client] [DBG] sm_send_registration: registration sent [2001:db8::2]
[lib/lwm2m_engine] [DBG] retransmit_request: Resending message: 0x20002954
@therealprof

This comment has been minimized.

Copy link
Contributor Author

commented Aug 28, 2018

@pfalcon The new crashes I've been seeing seem to be mostly unrelated to general network traffic, at least a Nucleo board will continue to run for a whole day just fine. However the FRDM will reliably crash with all the time and I can easily speed it up with LwM2M operations:

[net/core] [DBG] net_rx: (0x200021c4): Received pkt 0x20004a60 len 60
[net/ethernet] [DBG] ethernet_recv: (0x200021c4): iface 0x20009240 src 42:DA:4D:B4:4D:49 dst 00:04:9F:44:94:AA type 0x800 len 60
[net/ipv4] [DBG] net_ipv4_process_pkt: (0x200021c4): IPv4 packet received from 212.18.24.70 to 10.0.0.31
[net/conn] [DBG] net_conn_input: (0x200021c4): Check UDP listener for pkt 0x20004a60 src port 5683 dst port 65052 family 2 chksum 0xb305 len 36
[net/conn] [DBG] net_conn_input: (0x200021c4): [1] match found cb 0x00006eb9 ud 0x20000000 rank 0x27
[net/ctx] [DBG] net_context_packet_received: (0x200021c4): Set appdata 0x20005222 to len 8 (total 36)
[lib/lwm2m_engine] [DBG] lwm2m_udp_receive: checking for reply from [212.18.24.70]
[lib/lwm2m_engine] [DBG] handle_request: No accept option given. Assume OMA TLV.
[net/ethernet] [DBG] net_eth_fill_header: (0x200021c4): iface 0x20009240 src 00:04:9F:44:94:AA dst 42:DA:4D:B4:4D:49 type 0x800 len 180
[net/ethernet] [DBG] ethernet_send: (0x200021c4): Found ARP entry, sending pkt 0x20004dc0 to iface 0x20009240
***** BUS FAULT *****
  Imprecise data bus error
***** Hardware exception *****
Current thread ID = 0x200021c4
Faulting instruction address = 0x15210
Fatal fault in thread 0x200021c4! Aborting.
[net/core] [DBG] net_recv_data: (0x20003e68): prio 0 iface 0x20009240 pkt 0x20004aa8 len 60
[net/core] [DBG] net_recv_data: (0x20003e68): prio 0 iface 0x20009240 pkt 0x20004af0 len 60
***** BUS FAULT *****
  Imprecise data bus error
***** Hardware exception *****
Current thread ID = 0x20002610
Faulting instruction address = 0x5586
Fatal fault in thread 0x20002610! Aborting.
[net/core] [DBG] net_recv_data: (0x20003e68): prio 0 iface 0x20009240 pkt 0x20004a18 len 60
[net/core] [DBG] net_recv_data: (0x20003e68): prio 0 iface 0x20009240 pkt 0x200049d0 len 60
[net/core] [DBG] net_recv_data: (0x20003e68): prio 0 iface 0x20009240 pkt 0x20004988 len 60
[net/core] [DBG] net_recv_data: (0x20003e68): prio 0 iface 0x20009240 pkt 0x20004940 len 60
[net/core] [DBG] net_recv_data: (0x20003e68): prio 0 iface 0x20009240 pkt 0x200048f8 len 60
[net/core] [DBG] net_recv_data: (0x20003e68): prio 0 iface 0x20009240 pkt 0x200048b0 len 60
[net/core] [DBG] net_recv_data: (0x20003e68): prio 0 iface 0x20009240 pkt 0x20004868 len 250

Note that there's a few seconds delay before the last aborting then the net_recv_data output.

0x15210 is sys_dlist_remove
0x5586 is in memcpy

It is a bit nasty to debug in normal mode, I'm only getting:

Breakpoint 1, _BusFault (fromHardFault=0, esf=<optimized out>) at /Users/egger/oss/zephyr/arch/arm/core/fault.c:291
291	static int _BusFault(NANO_ESF *esf, int fromHardFault)
(gdb) bt
#0  _BusFault (fromHardFault=0, esf=<optimized out>) at /Users/egger/oss/zephyr/arch/arm/core/fault.c:291
#1  0x00004e22 in _FaultHandle (esf=0x20006bb8 <rx_stack+800>, fault=<optimized out>) at /Users/egger/oss/zephyr/arch/arm/core/fault.c:559
#2  _Fault (esf=0x20006bb8 <rx_stack+800>, exc_return=<optimized out>) at /Users/egger/oss/zephyr/arch/arm/core/fault.c:719
#3  0x00004f7a in __usage_fault () at /Users/egger/oss/zephyr/arch/arm/core/fault_s.S:143
#4  <signal handler called>
#5  0xaaaaaaaa in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

And of course once I set options for "better debugging experience" and fire it up in the debugger it chucks along just fine.

@therealprof

This comment has been minimized.

Copy link
Contributor Author

commented Aug 28, 2018

I'm running this on 3b80998, BTW.

@pfalcon

This comment has been minimized.

Copy link
Collaborator

commented Aug 28, 2018

I'm running 3b80998 too, as can be seen from log above. But for me, memcpy is at 0x5836. I wonder if going for that "reproducible build" stuff would be actually useful for us. You run SDK 0.9.3, don't you?

@pfalcon

This comment has been minimized.

Copy link
Collaborator

commented Aug 28, 2018

(Just had my laptop battery crash, so will write many small msgs.)

I also see that faults happen with real LwM2M interaction, I don't have setup to reproduce it. But such kinds of faults look suspiciously stack-related, so I'd suggest bumping stack size, and maybe not just main, but other too.

@therealprof

This comment has been minimized.

Copy link
Contributor Author

commented Aug 28, 2018

You run SDK 0.9.3, don't you?

Nope, never have, never will unless you're going to release it for macOS...

@pfalcon

This comment has been minimized.

Copy link
Collaborator

commented Aug 28, 2018

Ok, so we're in a typical situation of non-reproducibility, on multiple levels ;-).

I guess going as far as exchanging binaries won't give much good, so we should just assume that this issue is not fixed for 1.13, and keep it open. When you'll be able to test your original scenario with Nucleo-F429ZI, please post the results. I'll be approach the same issue from my side (e.g. #3132, #7831).

@pfalcon

This comment has been minimized.

Copy link
Collaborator

commented Aug 28, 2018

I guess this will need lowering of priority to not serve as a release blocker.

@therealprof

This comment has been minimized.

Copy link
Contributor Author

commented Aug 28, 2018

@pfalcon Oh sorry, as mentioned above my Nucleo seems to run stable now in a heavy network.

@pfalcon

This comment has been minimized.

Copy link
Collaborator

commented Aug 28, 2018

at least a Nucleo board will continue to run for a whole day just fine

Oops, might have read that by diagonal. Well, great news then, @jukkar's efforts weren't in vain! I can only suggest to close this ticket then, other issues like frdm_k64f faults can be pursued elsewhere.

@therealprof

This comment has been minimized.

Copy link
Contributor Author

commented Aug 28, 2018

Sure, works for me. You might want to remove the LwM2M label too...

@pfalcon pfalcon removed the area: LWM2M label Aug 28, 2018

@pfalcon pfalcon closed this Aug 28, 2018

dleach02 added a commit to dleach02/zephyr that referenced this issue Sep 10, 2018

net: core: Always have a timeout when allocating a net_buf
Instead of waiting forever for a network buffer, have a timeout
when allocating net_buf. This way we cannot left hanging for a
long time waiting for a buffer and possibly deadlock the system.
This commit only adds checks to core IP stack in subsys/net/ip

Fixes zephyrproject-rtos#7571

Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>

dleach02 added a commit to dleach02/zephyr that referenced this issue Sep 10, 2018

net: l2/lib: Always have a timeout when allocating a net_buf
Instead of waiting forever for a network buffer, have a timeout
when allocating net_buf. This way we cannot left hanging for a
long time waiting for a buffer and possibly deadlock the system.
This commit adds checks to L2 and network support libraries.

Fixes zephyrproject-rtos#7571

Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.