Dead loop of the kernel during Bluetooth Mesh pressure communication #12726

xiaoliang314 · 2019-01-25T10:06:08Z

Describe the bug
I created two Bluetooth Mesh nodes to put them in the same network. I use the serial to control one of the nodes to send packets to another node. Send the next packet immediately after receiving a reply from another node or a 5 second timeout. When I tested about 800 times, the node could not continue to respond to my serial commands. I used gdb to check the current running status. The information is as follows.

To Reproduce
Steps to reproduce the behavior:

cmake -DBOARD="nrf52_pca10040" ../..
make
See error

**gdb debug log**
(gdb) 
_priq_dumb_add (pq=pq@entry=0x200019ac <_kernel+40>, 
    thread=thread@entry=0x2000064c <adv_thread_data>)
    at /home/ubuntu/zephyr/kernel/sched.c:574
574		SYS_DLIST_FOR_EACH_CONTAINER(pq, t, base.qnode_dlist) {
(gdb) l
569	{
570		struct k_thread *t;
571	
572		__ASSERT_NO_MSG(!_is_idle(thread));
573	
574		SYS_DLIST_FOR_EACH_CONTAINER(pq, t, base.qnode_dlist) {
575			if (_is_t1_higher_prio_than_t2(thread, t)) {
576				sys_dlist_insert_before(pq, &t->base.qnode_dlist,
577							&thread->base.qnode_dlist);
578				return;
(gdb) l
579			}
580		}
581	
582		sys_dlist_append(pq, &thread->base.qnode_dlist);
583	}
584	
585	void _priq_dumb_remove(sys_dlist_t *pq, struct k_thread *thread)
586	{
587		__ASSERT_NO_MSG(!_is_idle(thread));
588	
(gdb) s
575			if (_is_t1_higher_prio_than_t2(thread, t)) {
(gdb) bt
#0  _priq_dumb_add (pq=pq@entry=0x200019ac <_kernel+40>, 
    thread=thread@entry=0x2000064c <adv_thread_data>)
    at /home/ubuntu/zephyr/kernel/sched.c:575
#1  0x0001ab06 in _add_thread_to_ready_q (thread=0x2000064c <adv_thread_data>)
    at /home/ubuntu/zephyr/kernel/sched.c:311
#2  0x0001b4f8 in z_clock_announce (ticks=<optimized out>)
    at /home/ubuntu/zephyr/kernel/timeout.c:194
#3  0x00004bc2 in _isr_wrapper ()
    at /home/ubuntu/zephyr/arch/arm/core/isr_wrapper.S:120
(gdb) s
_is_t1_higher_prio_than_t2 (t2=0x2000064c <adv_thread_data>, 
    t1=0x2000064c <adv_thread_data>) at /home/ubuntu/zephyr/kernel/sched.c:91
91		if (t1->base.prio < t2->base.prio) {
(gdb) p t2->base.prio
$1 = -9 '\367'
(gdb) p t1->base.prio
$2 = -9 '\367'
(gdb) s
_priq_dumb_add (pq=pq@entry=0x200019ac <_kernel+40>, 
    thread=thread@entry=0x2000064c <adv_thread_data>)
    at /home/ubuntu/zephyr/kernel/sched.c:574
574		SYS_DLIST_FOR_EACH_CONTAINER(pq, t, base.qnode_dlist) {
(gdb) p pq
$3 = (sys_dlist_t *) 0x200019ac <_kernel+40>
(gdb) p pq->next 
$4 = (struct _dnode *) 0x20000b5c <prio_recv_thread_data>
(gdb) p pq->next->next
$5 = (struct _dnode *) 0x20000550 <tx_thread_data>
(gdb) p pq->next->next->next
$6 = (struct _dnode *) 0x2000064c <adv_thread_data>
(gdb) p pq->next->next->next->next
$7 = (struct _dnode *) 0x2000064c <adv_thread_data>
(gdb) p pq->next->next->next->next->next
$8 = (struct _dnode *) 0x2000064c <adv_thread_data>
(gdb) p t
$9 = (struct k_thread *) 0x2000064c <adv_thread_data>
(gdb)

The text was updated successfully, but these errors were encountered:

xiaoliang314 · 2019-01-25T10:19:01Z

My version is:
commit 2caecb5 (HEAD -> master, origin/master, origin/HEAD)
Author: Jan Van Winkel jan.van_winkel@dxplore.eu
Date: Sat Jan 19 19:38:50 2019 +0100

display: Remove stale ILI9340 header file

Remove stale ILI9340 header file from global includes.

Signed-off-by: Jan Van Winkel <jan.van_winkel@dxplore.eu>

jhedberg · 2019-01-28T13:28:03Z

I can't quite decipher your gdb logs. Seems like they're pointing at the scheduler?

xiaoliang314 · 2019-01-30T16:09:32Z

When I pause with gdb, it stops on SYS_DLIST_FOR_EACH_CONTAINER, and the next of the adv_thread list node points to itself, causing SYS_DLIST_FOR_EACH_CONTAINER to become a dead loop. This seems to happen when adv_thread calls k_sleep.

jhedberg · 2019-01-30T16:19:02Z

There shouldn't be anything wrong with a thread calling k_sleep, so seems like a possible scheduler bug (or something like that)?

nashif · 2019-01-30T18:16:52Z

@andyross fyi

andyross · 2019-01-30T18:30:15Z

That's a corrupt list for sure. There actually was a race that could cause this, partially fixed in commit e664c78 (which I think you have, if I found that HEAD correctly). But just Monday @dcpleung found another spot where we had introduced a recursive spinlock (which on UP systems results in the lock being released early) and could plausibly cause this. I should have a fix for that one (and some robustness changes to detect the case when CONFIG_ASSERT is active, though I don't know if that would have saved you or not) up in the next day or two. We can hope.

xiaoliang314 · 2019-02-01T00:36:06Z

Thanks, Expect it to be fixed.

magilus · 2019-02-03T12:12:15Z

I might also be running into this problem (under heavy load of Mesh) but am not so familar with gdb unfortunately. I have also noticed next elements pointing to the referring element itself, but in my case of wait_q -> waitq. The symptoms are that my application freezes (i.e. it does not send / process advertisement messages) but it does not print a kernel panic or such in the serial console. Is it possible that I am hit by the same bug? @xiaoliang314 does this sound familar to you?

I'm on ff88b7f. Happens on PCA10040 and also on PCA10059 (but can't debug latter due to missing J-Link)


_pend_current_thread (key=0, wait_q=wait_q@entry=0x20001d88 <entropy_nrf5_data+24>, timeout=-1) at C:/Users/ma/zephyrproject/zephyr/kernel/sched.c:409
409	{

l
404		_ready_thread(th);
405	}
406	#endif
407	
408	int _pend_current_thread(u32_t key, _wait_q_t *wait_q, s32_t timeout)
409	{
410	#if defined(CONFIG_TIMESLICING) && defined(CONFIG_SWAP_NONATOMIC)
411		pending_current = _current;
412	#endif
413		pend(_current, wait_q, timeout);
l
414		return _Swap(key);
415	}
416	
417	struct k_thread *_unpend_first_thread(_wait_q_t *wait_q)
418	{
419		struct k_thread *t = _unpend1_no_timeout(wait_q);
420	
421		if (t != NULL) {
422			(void)_abort_thread_timeout(t);
423		}
s
411		pending_current = _current;
bt
#0  _pend_current_thread (key=0, wait_q=wait_q@entry=0x20001d88 <entropy_nrf5_data+24>, timeout=-1) at C:/Users/ma/zephyrproject/zephyr/kernel/sched.c:411
#1  0x00023c42 in _impl_k_sem_take (sem=sem@entry=0x20001d88 <entropy_nrf5_data+24>, timeout=timeout@entry=-1) at C:/Users/ma/zephyrproject/zephyr/kernel/sem.c:151
#2  0x00022c8a in k_sem_take (timeout=-1, sem=0x20001d88 <entropy_nrf5_data+24>) at zephyr/include/generated/syscalls/kernel.h:103
#3  entropy_nrf5_get_entropy (device=<optimized out>, buf=0x2000656a <_interrupt_stack+1970> "", len=2) at C:/Users/ma/zephyrproject/zephyr/drivers/entropy/entropy_nrf5.c:240
#4  0x00004076 in _impl_entropy_get_entropy (length=4, buffer=0x20006568 <_interrupt_stack+1968> "\200\067", dev=<optimized out>) at ../../../../include/entropy.h:76
#5  entropy_get_entropy (length=4, buffer=0x20006568 <_interrupt_stack+1968> "\200\067", dev=<optimized out>) at zephyr/include/generated/syscalls/entropy.h:13
#6  sys_rand32_get () at C:/Users/ma/zephyrproject/zephyr/subsys/random/rand32_entropy_device.c:31
#7  0x000007ea in button_pressed_worker (work=<optimized out>) at ../src/main.c:567
#8  0x00024320 in _timer_expiration_handler (t=0x2000005c <heartbeat_timer>) at C:/Users/ma/zephyrproject/zephyr/kernel/timer.c:69
#9  0x0002425e in z_clock_announce (ticks=<optimized out>) at C:/Users/ma/zephyrproject/zephyr/kernel/timeout.c:190
#10 0x00005532 in _isr_wrapper () at C:/Users/ma/zephyrproject/zephyr/arch/arm/core\isr_wrapper.S:120
#11 <signal handler called>
#12 0x122003f6 in ?? ()
#13 0x1e325c5a in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
s
413		pend(_current, wait_q, timeout);
s
pend (thread=0x20000edc <recv_thread_data>, wait_q=0x20001d88 <entropy_nrf5_data+24>, timeout=-1) at C:/Users/ma/zephyrproject/zephyr/kernel/sched.c:339
339	{
p wait_q
$2 = (_wait_q_t *) 0x20001d88 <entropy_nrf5_data+24>
p wait_q->waitq
$3 = {{head = 0x20000edc <recv_thread_data>, next = 0x20000edc <recv_thread_data>}, {tail = 0x20000edc <recv_thread_data>, prev = 0x20000edc <recv_thread_data>}}
p wait_q->waitq->next
$4 = (struct _dnode *) 0x20000edc <recv_thread_data>
p wait_q->waitq->next
$5 = (struct _dnode *) 0x20000edc <recv_thread_data>
p wait_q->waitq->next->next
$6 = (struct _dnode *) 0x20000edc <recv_thread_data>
p wait_q->waitq->next->next->next
$8 = (struct _dnode *) 0x20000edc <recv_thread_data>

xiaoliang314 · 2019-02-05T15:37:56Z

@pirast Yes, they may be caused by the same bug.

magilus · 2019-02-15T19:09:38Z

Hi @xiaoliang314, thanks! I have seen some commits in master that may tackle the issue. However, I still seem to encounter it but am not able to reproduce it right now with a debugger attached (trying over the weekend). In case you have time: Could you cross-check? You seem to have a more appropriate test setup at hand.

xiaoliang314 · 2019-02-18T01:39:23Z

@pirast @andyross I have updated the code to the latest version. I seem to be harder to reproduce it, but it still has problems. I pressure tested nearly 50,000 times but it still dies. I didn't use gdb to view the status, but the status is still that the uart console does not have any output.

xiaoliang314 · 2019-02-18T01:49:43Z

@pirast I modified the sample code so that it could be controlled by the host computer. I sent the next packet as soon as it was sent successfully. My uart baud rate is 1M.

magilus · 2019-02-21T15:52:12Z

I was able to reproduce it on latest master (c2d5e7b). This is the backtrace I could obtain with CONFIG_DEBUG=y.

After increasing the log buffer, I also get an exception message:

@andyross Is there any chance that you can come up with a solution? Is there any more information I can provide s.th. you can tackle this?

***** BUS FAULT *****
  Imprecise data bus error
***** Hardware exception *****
Current thread ID = 0x20000a28
Faulting instruction address = 0x2b43c
Fatal fault in thread 0x20000a28! Aborting.

#0  _priq_dumb_add (thread=0x20000ef8 <adv_thread_data>, pq=0x20003460 <_kernel+40>) at /home/psmeshloc/zephyr/kernel/sched.c:609
        t = <optimized out>
        t = <optimized out>
#1  _add_thread_to_ready_q (thread=thread@entry=0x20000ef8 <adv_thread_data>) at /home/psmeshloc/zephyr/kernel/sched.c:334
        __i = <optimized out>
        __key = {key = 0}
#2  0x0002b644 in _ready_thread (thread=0x20000ef8 <adv_thread_data>) at /home/psmeshloc/zephyr/kernel/include/ksched.h:233
No locals.
#3  z_thread_timeout (to=0x20000f10 <adv_thread_data+24>) at /home/psmeshloc/zephyr/kernel/sched.c:427
        th = 0x20000ef8 <adv_thread_data>
#4  0x0002c388 in z_clock_announce (ticks=<optimized out>) at /home/psmeshloc/zephyr/kernel/timeout.c:189
        t = 0x20000f10 <adv_thread_data+24>
        dt = <optimized out>
        key = {key = 0}
#5  0x00008bb4 in rtc1_nrf_isr (arg=<optimized out>) at /home/psmeshloc/zephyr/drivers/timer/nrf_rtc_timer.c:86
        key = <optimized out>
        t = <optimized out>
        dticks = <optimized out>
        next = <optimized out>
#6  0x00009462 in _isr_wrapper () at /home/psmeshloc/zephyr/arch/arm/core/isr_wrapper.S:120
No locals.
#7  <signal handler called>
No symbol table info available.
#8  0x6c5c7d88 in ?? ()
No symbol table info available.
#9  0x4cc01f80 in ?? ()
No symbol table info available.
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

jhedberg · 2019-03-07T09:07:29Z

I don't see how the scheduling corruption is Bluetooth related, so removing myself from the reviewers - at the same time I don't know who the right person would be so this will be unassigned for awhile.

laperie · 2019-03-07T19:42:25Z

@andyross , assigning to you based on #12726 (comment): Please request reporter to re-verify once the fixes you work on are merged

andyross · 2019-03-07T20:27:46Z

Is this still reproducible? Note that there have been several fixes to scheduler and locking code since the HEAD referenced above (dff6b71 seems most likely) which could produce symptoms exactly like that. Would be good to check again. Or maybe it was a single spurious failure? Has it happened since?

xiaoliang314 · 2019-03-08T03:12:08Z

@andyross I will update to the latest version for testing, I have prepared at least 12 hours of testing time and see if the problem exists.

bdrlamb · 2019-03-09T02:22:44Z

I have run into this same infinite loop recently. After doing some investigation, I believe in my case it is caused by k_spin_lock not masking all interrupts (Arch=STM32L1). It appears both higher priority (31 is what I tested with) and IRQ_ZERO_LATENCY do not get masked.

That can cause issues if the ISR interrupts _add_timeout or _remove_timeout, and the ISR itself calls k_poll_signal_raise or other kernel functions which call _add_timeout or _remove_timeout. In my case, _add_timeout was being interrupted and other calls were being made to _add_timeout/_remove_timeout and it ultimately ended up breaking the linked list.

I'm not sure if this is a bug or not but I think it might be worthwhile to have k_spin_lock assert if it catches itself being re-entrant. Or at least have the kernel timer locally record whether or not the list is locked and having it bugcheck if it finds that it somehow has become re-entrant - this was a difficult to track down given how difficult it is to reproduce.

bdrlamb · 2019-03-09T02:25:27Z

One other option may be to have a number of the kernel functions assert if they're ever called from an NMI.

xiaoliang314 · 2019-03-11T01:23:27Z

@bdrlamb Are you using the latest version?

andyross · 2019-03-11T19:27:06Z

@bdrlamb: The rule is that _arch_irq_lock() (the underlying primitive used by the spinlocks) must mask any interrupt used to inspect or modify Zephyr kernel state. It is allows that there be some levels higher than that, but they have to be for driver-specific purposes and only touch things that Zephyr never looks at. Some of the radio code on nRF5x works this way that I know of. The ZERO_LATENCY feature is deliberately placed at that priority.

And FWIW: when CONFIG_ASSERT is true, k_spin_lock() does indeed include a validation layer that will catch reentrant locking, and releases of unlocked locks.

andyross · 2019-03-12T18:00:55Z

I'm going to close this one. I'm really quite certain this is the issue resolved with commit dff6b71. Please reopen if the symptom reappears.

bdrlamb · 2019-03-28T07:06:23Z

@xiaoliang314 I wasn't using the absolute latest version, but it definitely included commit dff6b71. The issue was my fault (calling kernel functions in NMIs) and I have not encountered it since.

xiaoliang314 added the bug The issue is a bug, or the PR is fixing a bug label Jan 25, 2019

nashif assigned jhedberg Jan 28, 2019

nashif added the area: Bluetooth label Jan 28, 2019

galak added the priority: low Low impact/importance bug label Jan 29, 2019

jhedberg removed their assignment Mar 7, 2019

laperie assigned andyross Mar 7, 2019

andyross closed this as completed Mar 12, 2019

jhedberg mentioned this issue Mar 21, 2019

Mesh network traffic overflow ungraceful stop. (MMFAR Adress: 0x0) #14714

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dead loop of the kernel during Bluetooth Mesh pressure communication #12726

Dead loop of the kernel during Bluetooth Mesh pressure communication #12726

xiaoliang314 commented Jan 25, 2019 •

edited by nashif

xiaoliang314 commented Jan 25, 2019

jhedberg commented Jan 28, 2019

xiaoliang314 commented Jan 30, 2019

jhedberg commented Jan 30, 2019

nashif commented Jan 30, 2019

andyross commented Jan 30, 2019

xiaoliang314 commented Feb 1, 2019

magilus commented Feb 3, 2019 •

edited

xiaoliang314 commented Feb 5, 2019

magilus commented Feb 15, 2019

xiaoliang314 commented Feb 18, 2019 •

edited

xiaoliang314 commented Feb 18, 2019

magilus commented Feb 21, 2019 •

edited

jhedberg commented Mar 7, 2019

laperie commented Mar 7, 2019 •

edited

andyross commented Mar 7, 2019

xiaoliang314 commented Mar 8, 2019

bdrlamb commented Mar 9, 2019

bdrlamb commented Mar 9, 2019

xiaoliang314 commented Mar 11, 2019

andyross commented Mar 11, 2019

andyross commented Mar 12, 2019

bdrlamb commented Mar 28, 2019

Dead loop of the kernel during Bluetooth Mesh pressure communication #12726

Dead loop of the kernel during Bluetooth Mesh pressure communication #12726

Comments

xiaoliang314 commented Jan 25, 2019 • edited by nashif

xiaoliang314 commented Jan 25, 2019

jhedberg commented Jan 28, 2019

xiaoliang314 commented Jan 30, 2019

jhedberg commented Jan 30, 2019

nashif commented Jan 30, 2019

andyross commented Jan 30, 2019

xiaoliang314 commented Feb 1, 2019

magilus commented Feb 3, 2019 • edited

xiaoliang314 commented Feb 5, 2019

magilus commented Feb 15, 2019

xiaoliang314 commented Feb 18, 2019 • edited

xiaoliang314 commented Feb 18, 2019

magilus commented Feb 21, 2019 • edited

jhedberg commented Mar 7, 2019

laperie commented Mar 7, 2019 • edited

andyross commented Mar 7, 2019

xiaoliang314 commented Mar 8, 2019

bdrlamb commented Mar 9, 2019

bdrlamb commented Mar 9, 2019

xiaoliang314 commented Mar 11, 2019

andyross commented Mar 11, 2019

andyross commented Mar 12, 2019

bdrlamb commented Mar 28, 2019

xiaoliang314 commented Jan 25, 2019 •

edited by nashif

magilus commented Feb 3, 2019 •

edited

xiaoliang314 commented Feb 18, 2019 •

edited

magilus commented Feb 21, 2019 •

edited

laperie commented Mar 7, 2019 •

edited