Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dead loop of the kernel during Bluetooth Mesh pressure communication #12726

Closed
xiaoliang314 opened this issue Jan 25, 2019 · 23 comments
Closed
Assignees
Labels
area: Bluetooth bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug

Comments

@xiaoliang314
Copy link

xiaoliang314 commented Jan 25, 2019

Describe the bug
I created two Bluetooth Mesh nodes to put them in the same network. I use the serial to control one of the nodes to send packets to another node. Send the next packet immediately after receiving a reply from another node or a 5 second timeout. When I tested about 800 times, the node could not continue to respond to my serial commands. I used gdb to check the current running status. The information is as follows.

To Reproduce
Steps to reproduce the behavior:

  1. cmake -DBOARD="nrf52_pca10040" ../..
  2. make
  3. See error
**gdb debug log**
(gdb) 
_priq_dumb_add (pq=pq@entry=0x200019ac <_kernel+40>, 
    thread=thread@entry=0x2000064c <adv_thread_data>)
    at /home/ubuntu/zephyr/kernel/sched.c:574
574		SYS_DLIST_FOR_EACH_CONTAINER(pq, t, base.qnode_dlist) {
(gdb) l
569	{
570		struct k_thread *t;
571	
572		__ASSERT_NO_MSG(!_is_idle(thread));
573	
574		SYS_DLIST_FOR_EACH_CONTAINER(pq, t, base.qnode_dlist) {
575			if (_is_t1_higher_prio_than_t2(thread, t)) {
576				sys_dlist_insert_before(pq, &t->base.qnode_dlist,
577							&thread->base.qnode_dlist);
578				return;
(gdb) l
579			}
580		}
581	
582		sys_dlist_append(pq, &thread->base.qnode_dlist);
583	}
584	
585	void _priq_dumb_remove(sys_dlist_t *pq, struct k_thread *thread)
586	{
587		__ASSERT_NO_MSG(!_is_idle(thread));
588	
(gdb) s
575			if (_is_t1_higher_prio_than_t2(thread, t)) {
(gdb) bt
#0  _priq_dumb_add (pq=pq@entry=0x200019ac <_kernel+40>, 
    thread=thread@entry=0x2000064c <adv_thread_data>)
    at /home/ubuntu/zephyr/kernel/sched.c:575
#1  0x0001ab06 in _add_thread_to_ready_q (thread=0x2000064c <adv_thread_data>)
    at /home/ubuntu/zephyr/kernel/sched.c:311
#2  0x0001b4f8 in z_clock_announce (ticks=<optimized out>)
    at /home/ubuntu/zephyr/kernel/timeout.c:194
#3  0x00004bc2 in _isr_wrapper ()
    at /home/ubuntu/zephyr/arch/arm/core/isr_wrapper.S:120
(gdb) s
_is_t1_higher_prio_than_t2 (t2=0x2000064c <adv_thread_data>, 
    t1=0x2000064c <adv_thread_data>) at /home/ubuntu/zephyr/kernel/sched.c:91
91		if (t1->base.prio < t2->base.prio) {
(gdb) p t2->base.prio
$1 = -9 '\367'
(gdb) p t1->base.prio
$2 = -9 '\367'
(gdb) s
_priq_dumb_add (pq=pq@entry=0x200019ac <_kernel+40>, 
    thread=thread@entry=0x2000064c <adv_thread_data>)
    at /home/ubuntu/zephyr/kernel/sched.c:574
574		SYS_DLIST_FOR_EACH_CONTAINER(pq, t, base.qnode_dlist) {
(gdb) p pq
$3 = (sys_dlist_t *) 0x200019ac <_kernel+40>
(gdb) p pq->next 
$4 = (struct _dnode *) 0x20000b5c <prio_recv_thread_data>
(gdb) p pq->next->next
$5 = (struct _dnode *) 0x20000550 <tx_thread_data>
(gdb) p pq->next->next->next
$6 = (struct _dnode *) 0x2000064c <adv_thread_data>
(gdb) p pq->next->next->next->next
$7 = (struct _dnode *) 0x2000064c <adv_thread_data>
(gdb) p pq->next->next->next->next->next
$8 = (struct _dnode *) 0x2000064c <adv_thread_data>
(gdb) p t
$9 = (struct k_thread *) 0x2000064c <adv_thread_data>
(gdb)   
@xiaoliang314 xiaoliang314 added the bug The issue is a bug, or the PR is fixing a bug label Jan 25, 2019
@xiaoliang314
Copy link
Author

My version is:
commit 2caecb5 (HEAD -> master, origin/master, origin/HEAD)
Author: Jan Van Winkel jan.van_winkel@dxplore.eu
Date: Sat Jan 19 19:38:50 2019 +0100

display: Remove stale ILI9340 header file

Remove stale ILI9340 header file from global includes.

Signed-off-by: Jan Van Winkel <jan.van_winkel@dxplore.eu>

@jhedberg
Copy link
Member

I can't quite decipher your gdb logs. Seems like they're pointing at the scheduler?

@galak galak added the priority: low Low impact/importance bug label Jan 29, 2019
@xiaoliang314
Copy link
Author

When I pause with gdb, it stops on SYS_DLIST_FOR_EACH_CONTAINER, and the next of the adv_thread list node points to itself, causing SYS_DLIST_FOR_EACH_CONTAINER to become a dead loop. This seems to happen when adv_thread calls k_sleep.

@jhedberg
Copy link
Member

There shouldn't be anything wrong with a thread calling k_sleep, so seems like a possible scheduler bug (or something like that)?

@nashif
Copy link
Member

nashif commented Jan 30, 2019

@andyross fyi

@andyross
Copy link
Contributor

That's a corrupt list for sure. There actually was a race that could cause this, partially fixed in commit e664c78 (which I think you have, if I found that HEAD correctly). But just Monday @dcpleung found another spot where we had introduced a recursive spinlock (which on UP systems results in the lock being released early) and could plausibly cause this. I should have a fix for that one (and some robustness changes to detect the case when CONFIG_ASSERT is active, though I don't know if that would have saved you or not) up in the next day or two. We can hope.

@xiaoliang314
Copy link
Author

Thanks, Expect it to be fixed.

@magilus
Copy link

magilus commented Feb 3, 2019

I might also be running into this problem (under heavy load of Mesh) but am not so familar with gdb unfortunately. I have also noticed next elements pointing to the referring element itself, but in my case of wait_q -> waitq. The symptoms are that my application freezes (i.e. it does not send / process advertisement messages) but it does not print a kernel panic or such in the serial console. Is it possible that I am hit by the same bug? @xiaoliang314 does this sound familar to you?

I'm on ff88b7f. Happens on PCA10040 and also on PCA10059 (but can't debug latter due to missing J-Link)


_pend_current_thread (key=0, wait_q=wait_q@entry=0x20001d88 <entropy_nrf5_data+24>, timeout=-1) at C:/Users/ma/zephyrproject/zephyr/kernel/sched.c:409
409	{

l
404		_ready_thread(th);
405	}
406	#endif
407	
408	int _pend_current_thread(u32_t key, _wait_q_t *wait_q, s32_t timeout)
409	{
410	#if defined(CONFIG_TIMESLICING) && defined(CONFIG_SWAP_NONATOMIC)
411		pending_current = _current;
412	#endif
413		pend(_current, wait_q, timeout);
l
414		return _Swap(key);
415	}
416	
417	struct k_thread *_unpend_first_thread(_wait_q_t *wait_q)
418	{
419		struct k_thread *t = _unpend1_no_timeout(wait_q);
420	
421		if (t != NULL) {
422			(void)_abort_thread_timeout(t);
423		}
s
411		pending_current = _current;
bt
#0  _pend_current_thread (key=0, wait_q=wait_q@entry=0x20001d88 <entropy_nrf5_data+24>, timeout=-1) at C:/Users/ma/zephyrproject/zephyr/kernel/sched.c:411
#1  0x00023c42 in _impl_k_sem_take (sem=sem@entry=0x20001d88 <entropy_nrf5_data+24>, timeout=timeout@entry=-1) at C:/Users/ma/zephyrproject/zephyr/kernel/sem.c:151
#2  0x00022c8a in k_sem_take (timeout=-1, sem=0x20001d88 <entropy_nrf5_data+24>) at zephyr/include/generated/syscalls/kernel.h:103
#3  entropy_nrf5_get_entropy (device=<optimized out>, buf=0x2000656a <_interrupt_stack+1970> "", len=2) at C:/Users/ma/zephyrproject/zephyr/drivers/entropy/entropy_nrf5.c:240
#4  0x00004076 in _impl_entropy_get_entropy (length=4, buffer=0x20006568 <_interrupt_stack+1968> "\200\067", dev=<optimized out>) at ../../../../include/entropy.h:76
#5  entropy_get_entropy (length=4, buffer=0x20006568 <_interrupt_stack+1968> "\200\067", dev=<optimized out>) at zephyr/include/generated/syscalls/entropy.h:13
#6  sys_rand32_get () at C:/Users/ma/zephyrproject/zephyr/subsys/random/rand32_entropy_device.c:31
#7  0x000007ea in button_pressed_worker (work=<optimized out>) at ../src/main.c:567
#8  0x00024320 in _timer_expiration_handler (t=0x2000005c <heartbeat_timer>) at C:/Users/ma/zephyrproject/zephyr/kernel/timer.c:69
#9  0x0002425e in z_clock_announce (ticks=<optimized out>) at C:/Users/ma/zephyrproject/zephyr/kernel/timeout.c:190
#10 0x00005532 in _isr_wrapper () at C:/Users/ma/zephyrproject/zephyr/arch/arm/core\isr_wrapper.S:120
#11 <signal handler called>
#12 0x122003f6 in ?? ()
#13 0x1e325c5a in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
s
413		pend(_current, wait_q, timeout);
s
pend (thread=0x20000edc <recv_thread_data>, wait_q=0x20001d88 <entropy_nrf5_data+24>, timeout=-1) at C:/Users/ma/zephyrproject/zephyr/kernel/sched.c:339
339	{
p wait_q
$2 = (_wait_q_t *) 0x20001d88 <entropy_nrf5_data+24>
p wait_q->waitq
$3 = {{head = 0x20000edc <recv_thread_data>, next = 0x20000edc <recv_thread_data>}, {tail = 0x20000edc <recv_thread_data>, prev = 0x20000edc <recv_thread_data>}}
p wait_q->waitq->next
$4 = (struct _dnode *) 0x20000edc <recv_thread_data>
p wait_q->waitq->next
$5 = (struct _dnode *) 0x20000edc <recv_thread_data>
p wait_q->waitq->next->next
$6 = (struct _dnode *) 0x20000edc <recv_thread_data>
p wait_q->waitq->next->next->next
$8 = (struct _dnode *) 0x20000edc <recv_thread_data>

@xiaoliang314
Copy link
Author

@pirast Yes, they may be caused by the same bug.

@magilus
Copy link

magilus commented Feb 15, 2019

Hi @xiaoliang314, thanks! I have seen some commits in master that may tackle the issue. However, I still seem to encounter it but am not able to reproduce it right now with a debugger attached (trying over the weekend). In case you have time: Could you cross-check? You seem to have a more appropriate test setup at hand.

@xiaoliang314
Copy link
Author

xiaoliang314 commented Feb 18, 2019

@pirast @andyross I have updated the code to the latest version. I seem to be harder to reproduce it, but it still has problems. I pressure tested nearly 50,000 times but it still dies. I didn't use gdb to view the status, but the status is still that the uart console does not have any output.

@xiaoliang314
Copy link
Author

@pirast I modified the sample code so that it could be controlled by the host computer. I sent the next packet as soon as it was sent successfully. My uart baud rate is 1M.

@magilus
Copy link

magilus commented Feb 21, 2019

I was able to reproduce it on latest master (c2d5e7b). This is the backtrace I could obtain with CONFIG_DEBUG=y.

After increasing the log buffer, I also get an exception message:

@andyross Is there any chance that you can come up with a solution? Is there any more information I can provide s.th. you can tackle this?

***** BUS FAULT *****
  Imprecise data bus error
***** Hardware exception *****
Current thread ID = 0x20000a28
Faulting instruction address = 0x2b43c
Fatal fault in thread 0x20000a28! Aborting.
#0  _priq_dumb_add (thread=0x20000ef8 <adv_thread_data>, pq=0x20003460 <_kernel+40>) at /home/psmeshloc/zephyr/kernel/sched.c:609
        t = <optimized out>
        t = <optimized out>
#1  _add_thread_to_ready_q (thread=thread@entry=0x20000ef8 <adv_thread_data>) at /home/psmeshloc/zephyr/kernel/sched.c:334
        __i = <optimized out>
        __key = {key = 0}
#2  0x0002b644 in _ready_thread (thread=0x20000ef8 <adv_thread_data>) at /home/psmeshloc/zephyr/kernel/include/ksched.h:233
No locals.
#3  z_thread_timeout (to=0x20000f10 <adv_thread_data+24>) at /home/psmeshloc/zephyr/kernel/sched.c:427
        th = 0x20000ef8 <adv_thread_data>
#4  0x0002c388 in z_clock_announce (ticks=<optimized out>) at /home/psmeshloc/zephyr/kernel/timeout.c:189
        t = 0x20000f10 <adv_thread_data+24>
        dt = <optimized out>
        key = {key = 0}
#5  0x00008bb4 in rtc1_nrf_isr (arg=<optimized out>) at /home/psmeshloc/zephyr/drivers/timer/nrf_rtc_timer.c:86
        key = <optimized out>
        t = <optimized out>
        dticks = <optimized out>
        next = <optimized out>
#6  0x00009462 in _isr_wrapper () at /home/psmeshloc/zephyr/arch/arm/core/isr_wrapper.S:120
No locals.
#7  <signal handler called>
No symbol table info available.
#8  0x6c5c7d88 in ?? ()
No symbol table info available.
#9  0x4cc01f80 in ?? ()
No symbol table info available.
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

@jhedberg
Copy link
Member

jhedberg commented Mar 7, 2019

I don't see how the scheduling corruption is Bluetooth related, so removing myself from the reviewers - at the same time I don't know who the right person would be so this will be unassigned for awhile.

@jhedberg jhedberg removed their assignment Mar 7, 2019
@laperie
Copy link
Collaborator

laperie commented Mar 7, 2019

@andyross , assigning to you based on #12726 (comment): Please request reporter to re-verify once the fixes you work on are merged

@andyross
Copy link
Contributor

andyross commented Mar 7, 2019

Is this still reproducible? Note that there have been several fixes to scheduler and locking code since the HEAD referenced above (dff6b71 seems most likely) which could produce symptoms exactly like that. Would be good to check again. Or maybe it was a single spurious failure? Has it happened since?

@xiaoliang314
Copy link
Author

@andyross I will update to the latest version for testing, I have prepared at least 12 hours of testing time and see if the problem exists.

@bdrlamb
Copy link

bdrlamb commented Mar 9, 2019

I have run into this same infinite loop recently. After doing some investigation, I believe in my case it is caused by k_spin_lock not masking all interrupts (Arch=STM32L1). It appears both higher priority (31 is what I tested with) and IRQ_ZERO_LATENCY do not get masked.

That can cause issues if the ISR interrupts _add_timeout or _remove_timeout, and the ISR itself calls k_poll_signal_raise or other kernel functions which call _add_timeout or _remove_timeout. In my case, _add_timeout was being interrupted and other calls were being made to _add_timeout/_remove_timeout and it ultimately ended up breaking the linked list.

I'm not sure if this is a bug or not but I think it might be worthwhile to have k_spin_lock assert if it catches itself being re-entrant. Or at least have the kernel timer locally record whether or not the list is locked and having it bugcheck if it finds that it somehow has become re-entrant - this was a difficult to track down given how difficult it is to reproduce.

@bdrlamb
Copy link

bdrlamb commented Mar 9, 2019

One other option may be to have a number of the kernel functions assert if they're ever called from an NMI.

@xiaoliang314
Copy link
Author

@bdrlamb Are you using the latest version?

@andyross
Copy link
Contributor

@bdrlamb: The rule is that _arch_irq_lock() (the underlying primitive used by the spinlocks) must mask any interrupt used to inspect or modify Zephyr kernel state. It is allows that there be some levels higher than that, but they have to be for driver-specific purposes and only touch things that Zephyr never looks at. Some of the radio code on nRF5x works this way that I know of. The ZERO_LATENCY feature is deliberately placed at that priority.

And FWIW: when CONFIG_ASSERT is true, k_spin_lock() does indeed include a validation layer that will catch reentrant locking, and releases of unlocked locks.

@andyross
Copy link
Contributor

I'm going to close this one. I'm really quite certain this is the issue resolved with commit dff6b71. Please reopen if the symptom reappears.

@bdrlamb
Copy link

bdrlamb commented Mar 28, 2019

@xiaoliang314 I wasn't using the absolute latest version, but it definitely included commit dff6b71. The issue was my fault (calling kernel functions in NMIs) and I have not encountered it since.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Bluetooth bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug
Projects
None yet
Development

No branches or pull requests

8 participants