unplugging usb remote caused kernel oops and system freeze #30

Closed
s7mx1 opened this Issue Jun 2, 2012 · 5 comments

Comments

Projects
None yet
3 participants

s7mx1 commented Jun 2, 2012

Was using (lirc) usb remote with my pi. Everything works fine but it will kernel opps if I

  1. unplug the usb remote when there is a lirc client connected
  2. any lirc client disconnects from lircd

If I unload the relevant kernel module first however the kernel oops does not happen. When kernel oops does happen the screen, network and everything seems to freeze and the serial console (tail -f /var/log/kernel.log) was flooded was error messages and I was not able to exit the tail by holding ctrl+c. I had to power cycle it get it back.

Here is the kernel.log message when I unplugged the usb remote

http://db.tt/11UZWZJW

kernel messages when I stop lircd with active lirc client

http://db.tt/V26UHG2e

kernel messages when lirc client disconnects from lircd
http://db.tt/PNR9P1Jy

Collaborator

popcornmix commented Jun 2, 2012

Are you running with latest kernel? (e.g. by using Hexxeh's firmware updater)

s7mx1 commented Jun 2, 2012

I am using a fairly rent kernel and firmware (1 or 2 days old)

This is an old issue. If you test it with other linux system it's likely it will also happen.. it's not rpi related, my guess is that it's mceusb module fault

s7mx1 commented Jul 10, 2012

Just built the latest kernel from source and looks like this dwc_otg patch fixed the issue for me.
68b4a99

s7mx1 commented Jul 10, 2012

I am happy to report that two of my remotes (xbox1 dvd remote and one MCE remote) no longer causing the system to freeze when I unplug them. This is with the kernel built from latest git tree.

@s7mx1 s7mx1 closed this Jul 10, 2012

popcornmix pushed a commit that referenced this issue Oct 13, 2012

x86: Remove some noise from boot log when starting cpus
Printing the "start_ip" for every secondary cpu is very noisy on a large
system - and doesn't add any value. Drop this message.

Console log before:
Booting Node   0, Processors  #1
smpboot cpu 1: start_ip = 96000
 #2
smpboot cpu 2: start_ip = 96000
 #3
smpboot cpu 3: start_ip = 96000
 #4
smpboot cpu 4: start_ip = 96000
       ...
 #31
smpboot cpu 31: start_ip = 96000
Brought up 32 CPUs

Console log after:
Booting Node   0, Processors  #1 #2 #3 #4 #5 #6 #7 Ok.
Booting Node   1, Processors  #8 #9 #10 #11 #12 #13 #14 #15 Ok.
Booting Node   0, Processors  #16 #17 #18 #19 #20 #21 #22 #23 Ok.
Booting Node   1, Processors  #24 #25 #26 #27 #28 #29 #30 #31
Brought up 32 CPUs

Acked-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4f452eb42507460426@agluck-desktop.sc.intel.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>

popcornmix pushed a commit that referenced this issue Oct 13, 2012

ftrace/x86: Do not change stacks in DEBUG when calling lockdep
When both DYNAMIC_FTRACE and LOCKDEP are set, the TRACE_IRQS_ON/OFF
will call into the lockdep code. The lockdep code can call lots of
functions that may be traced by ftrace. When ftrace is updating its
code and hits a breakpoint, the breakpoint handler will call into
lockdep. If lockdep happens to call a function that also has a breakpoint
attached, it will jump back into the breakpoint handler resetting
the stack to the debug stack and corrupt the contents currently on
that stack.

The 'do_sym' call that calls do_int3() is protected by modifying the
IST table to point to a different location if another breakpoint is
hit. But the TRACE_IRQS_OFF/ON are outside that protection, and if
a breakpoint is hit from those, the stack will get corrupted, and
the kernel will crash:

[ 1013.243754] BUG: unable to handle kernel NULL pointer dereference at 0000000000000002
[ 1013.272665] IP: [<ffff880145cc0000>] 0xffff880145cbffff
[ 1013.285186] PGD 1401b2067 PUD 14324c067 PMD 0
[ 1013.298832] Oops: 0010 [#1] PREEMPT SMP
[ 1013.310600] CPU 2
[ 1013.317904] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables crc32c_intel ghash_clmulni_intel microcode usb_debug serio_raw pcspkr iTCO_wdt i2c_i801 iTCO_vendor_support e1000e nfsd nfs_acl auth_rpcgss lockd sunrpc i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: scsi_wait_scan]
[ 1013.401848]
[ 1013.407399] Pid: 112, comm: kworker/2:1 Not tainted 3.4.0+ #30
[ 1013.437943] RIP: 8eb8:[<ffff88014630a000>]  [<ffff88014630a000>] 0xffff880146309fff
[ 1013.459871] RSP: ffffffff8165e919:ffff88014780f408  EFLAGS: 00010046
[ 1013.477909] RAX: 0000000000000001 RBX: ffffffff81104020 RCX: 0000000000000000
[ 1013.499458] RDX: ffff880148008ea8 RSI: ffffffff8131ef40 RDI: ffffffff82203b20
[ 1013.521612] RBP: ffffffff81005751 R08: 0000000000000000 R09: 0000000000000000
[ 1013.543121] R10: ffffffff82cdc318 R11: 0000000000000000 R12: ffff880145cc0000
[ 1013.564614] R13: ffff880148008eb8 R14: 0000000000000002 R15: ffff88014780cb40
[ 1013.586108] FS:  0000000000000000(0000) GS:ffff880148000000(0000) knlGS:0000000000000000
[ 1013.609458] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1013.627420] CR2: 0000000000000002 CR3: 0000000141f10000 CR4: 00000000001407e0
[ 1013.649051] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1013.670724] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1013.692376] Process kworker/2:1 (pid: 112, threadinfo ffff88013fe0e000, task ffff88014020a6a0)
[ 1013.717028] Stack:
[ 1013.724131]  ffff88014780f570 ffff880145cc0000 0000400000004000 0000000000000000
[ 1013.745918]  cccccccccccccccc ffff88014780cca8 ffffffff811072bb ffffffff81651627
[ 1013.767870]  ffffffff8118f8a7 ffffffff811072bb ffffffff81f2b6c5 ffffffff81f11bdb
[ 1013.790021] Call Trace:
[ 1013.800701] Code: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a <e7> d7 64 81 ff ff ff ff 01 00 00 00 00 00 00 00 65 d9 64 81 ff
[ 1013.861443] RIP  [<ffff88014630a000>] 0xffff880146309fff
[ 1013.884466]  RSP <ffff88014780f408>
[ 1013.901507] CR2: 0000000000000002

The solution was to reuse the NMI functions that change the IDT table to make the debug
stack keep its current stack (in kernel mode) when hitting a breakpoint:

  call debug_stack_set_zero
  TRACE_IRQS_ON
  call debug_stack_reset

If the TRACE_IRQS_ON happens to hit a breakpoint then it will keep the current stack
and not crash the box.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

ghaskins added a commit to ghaskins/raspberrypi-rt that referenced this issue Feb 20, 2013

Subject: rcu: Frob softirq test
With RT_FULL we get the below wreckage:

[  126.060484] =======================================================
[  126.060486] [ INFO: possible circular locking dependency detected ]
[  126.060489] 3.0.1-rt10+ #30
[  126.060490] -------------------------------------------------------
[  126.060492] irq/24-eth0/1235 is trying to acquire lock:
[  126.060495]  (&(lock)->wait_lock#2){+.+...}, at: [<ffffffff81501c81>] rt_mutex_slowunlock+0x16/0x55
[  126.060503]
[  126.060504] but task is already holding lock:
[  126.060506]  (&p->pi_lock){-...-.}, at: [<ffffffff81074fdc>] try_to_wake_up+0x35/0x429
[  126.060511]
[  126.060511] which lock already depends on the new lock.
[  126.060513]
[  126.060514]
[  126.060514] the existing dependency chain (in reverse order) is:
[  126.060516]
[  126.060516] -> #1 (&p->pi_lock){-...-.}:
[  126.060519]        [<ffffffff810afe9e>] lock_acquire+0x145/0x18a
[  126.060524]        [<ffffffff8150291e>] _raw_spin_lock_irqsave+0x4b/0x85
[  126.060527]        [<ffffffff810b5aa4>] task_blocks_on_rt_mutex+0x36/0x20f
[  126.060531]        [<ffffffff815019bb>] rt_mutex_slowlock+0xd1/0x15a
[  126.060534]        [<ffffffff81501ae3>] rt_mutex_lock+0x2d/0x2f
[  126.060537]        [<ffffffff810d9020>] rcu_boost+0xad/0xde
[  126.060541]        [<ffffffff810d90ce>] rcu_boost_kthread+0x7d/0x9b
[  126.060544]        [<ffffffff8109a760>] kthread+0x99/0xa1
[  126.060547]        [<ffffffff81509b14>] kernel_thread_helper+0x4/0x10
[  126.060551]
[  126.060552] -> #0 (&(lock)->wait_lock#2){+.+...}:
[  126.060555]        [<ffffffff810af1b8>] __lock_acquire+0x1157/0x1816
[  126.060558]        [<ffffffff810afe9e>] lock_acquire+0x145/0x18a
[  126.060561]        [<ffffffff8150279e>] _raw_spin_lock+0x40/0x73
[  126.060564]        [<ffffffff81501c81>] rt_mutex_slowunlock+0x16/0x55
[  126.060566]        [<ffffffff81501ce7>] rt_mutex_unlock+0x27/0x29
[  126.060569]        [<ffffffff810d9f86>] rcu_read_unlock_special+0x17e/0x1c4
[  126.060573]        [<ffffffff810da014>] __rcu_read_unlock+0x48/0x89
[  126.060576]        [<ffffffff8106847a>] select_task_rq_rt+0xc7/0xd5
[  126.060580]        [<ffffffff8107511c>] try_to_wake_up+0x175/0x429
[  126.060583]        [<ffffffff81075425>] wake_up_process+0x15/0x17
[  126.060585]        [<ffffffff81080a51>] wakeup_softirqd+0x24/0x26
[  126.060590]        [<ffffffff81081df9>] irq_exit+0x49/0x55
[  126.060593]        [<ffffffff8150a3bd>] smp_apic_timer_interrupt+0x8a/0x98
[  126.060597]        [<ffffffff81509793>] apic_timer_interrupt+0x13/0x20
[  126.060600]        [<ffffffff810d5952>] irq_forced_thread_fn+0x1b/0x44
[  126.060603]        [<ffffffff810d582c>] irq_thread+0xde/0x1af
[  126.060606]        [<ffffffff8109a760>] kthread+0x99/0xa1
[  126.060608]        [<ffffffff81509b14>] kernel_thread_helper+0x4/0x10
[  126.060611]
[  126.060612] other info that might help us debug this:
[  126.060614]
[  126.060615]  Possible unsafe locking scenario:
[  126.060616]
[  126.060617]        CPU0                    CPU1
[  126.060619]        ----                    ----
[  126.060620]   lock(&p->pi_lock);
[  126.060623]                                lock(&(lock)->wait_lock);
[  126.060625]                                lock(&p->pi_lock);
[  126.060627]   lock(&(lock)->wait_lock);
[  126.060629]
[  126.060629]  *** DEADLOCK ***
[  126.060630]
[  126.060632] 1 lock held by irq/24-eth0/1235:
[  126.060633]  #0:  (&p->pi_lock){-...-.}, at: [<ffffffff81074fdc>] try_to_wake_up+0x35/0x429
[  126.060638]
[  126.060638] stack backtrace:
[  126.060641] Pid: 1235, comm: irq/24-eth0 Not tainted 3.0.1-rt10+ #30
[  126.060643] Call Trace:
[  126.060644]  <IRQ>  [<ffffffff810acbde>] print_circular_bug+0x289/0x29a
[  126.060651]  [<ffffffff810af1b8>] __lock_acquire+0x1157/0x1816
[  126.060655]  [<ffffffff810ab3aa>] ? trace_hardirqs_off_caller+0x1f/0x99
[  126.060658]  [<ffffffff81501c81>] ? rt_mutex_slowunlock+0x16/0x55
[  126.060661]  [<ffffffff810afe9e>] lock_acquire+0x145/0x18a
[  126.060664]  [<ffffffff81501c81>] ? rt_mutex_slowunlock+0x16/0x55
[  126.060668]  [<ffffffff8150279e>] _raw_spin_lock+0x40/0x73
[  126.060671]  [<ffffffff81501c81>] ? rt_mutex_slowunlock+0x16/0x55
[  126.060674]  [<ffffffff810d9655>] ? rcu_report_qs_rsp+0x87/0x8c
[  126.060677]  [<ffffffff81501c81>] rt_mutex_slowunlock+0x16/0x55
[  126.060680]  [<ffffffff810d9ea3>] ? rcu_read_unlock_special+0x9b/0x1c4
[  126.060683]  [<ffffffff81501ce7>] rt_mutex_unlock+0x27/0x29
[  126.060687]  [<ffffffff810d9f86>] rcu_read_unlock_special+0x17e/0x1c4
[  126.060690]  [<ffffffff810da014>] __rcu_read_unlock+0x48/0x89
[  126.060693]  [<ffffffff8106847a>] select_task_rq_rt+0xc7/0xd5
[  126.060696]  [<ffffffff810683da>] ? select_task_rq_rt+0x27/0xd5
[  126.060701]  [<ffffffff810a852a>] ? clockevents_program_event+0x8e/0x90
[  126.060704]  [<ffffffff8107511c>] try_to_wake_up+0x175/0x429
[  126.060708]  [<ffffffff810a95dc>] ? tick_program_event+0x1f/0x21
[  126.060711]  [<ffffffff81075425>] wake_up_process+0x15/0x17
[  126.060715]  [<ffffffff81080a51>] wakeup_softirqd+0x24/0x26
[  126.060718]  [<ffffffff81081df9>] irq_exit+0x49/0x55
[  126.060721]  [<ffffffff8150a3bd>] smp_apic_timer_interrupt+0x8a/0x98
[  126.060724]  [<ffffffff81509793>] apic_timer_interrupt+0x13/0x20
[  126.060726]  <EOI>  [<ffffffff81072855>] ? migrate_disable+0x75/0x12d
[  126.060733]  [<ffffffff81080a61>] ? local_bh_disable+0xe/0x1f
[  126.060736]  [<ffffffff81080a70>] ? local_bh_disable+0x1d/0x1f
[  126.060739]  [<ffffffff810d5952>] irq_forced_thread_fn+0x1b/0x44
[  126.060742]  [<ffffffff81502ac0>] ? _raw_spin_unlock_irq+0x3b/0x59
[  126.060745]  [<ffffffff810d582c>] irq_thread+0xde/0x1af
[  126.060748]  [<ffffffff810d5937>] ? irq_thread_fn+0x3a/0x3a
[  126.060751]  [<ffffffff810d574e>] ? irq_finalize_oneshot+0xd1/0xd1
[  126.060754]  [<ffffffff810d574e>] ? irq_finalize_oneshot+0xd1/0xd1
[  126.060757]  [<ffffffff8109a760>] kthread+0x99/0xa1
[  126.060761]  [<ffffffff81509b14>] kernel_thread_helper+0x4/0x10
[  126.060764]  [<ffffffff81069ed7>] ? finish_task_switch+0x87/0x10a
[  126.060768]  [<ffffffff81502ec4>] ? retint_restore_args+0xe/0xe
[  126.060771]  [<ffffffff8109a6c7>] ? __init_kthread_worker+0x8c/0x8c
[  126.060774]  [<ffffffff81509b10>] ? gs_change+0xb/0xb

Because irq_exit() does:

void irq_exit(void)
{
	account_system_vtime(current);
	trace_hardirq_exit();
	sub_preempt_count(IRQ_EXIT_OFFSET);
	if (!in_interrupt() && local_softirq_pending())
		invoke_softirq();

	...
}

Which triggers a wakeup, which uses RCU, now if the interrupted task has
t->rcu_read_unlock_special set, the rcu usage from the wakeup will end
up in rcu_read_unlock_special(). rcu_read_unlock_special() will test
for in_irq(), which will fail as we just decremented preempt_count
with IRQ_EXIT_OFFSET, and in_sering_softirq(), which for
PREEMPT_RT_FULL reads:

int in_serving_softirq(void)
{
	int res;

	preempt_disable();
	res = __get_cpu_var(local_softirq_runner) == current;
	preempt_enable();
	return res;
}

Which will thus also fail, resulting in the above wreckage.

The 'somewhat' ugly solution is to open-code the preempt_count() test
in rcu_read_unlock_special().

Also, we're not at all sure how ->rcu_read_unlock_special gets set
here... so this is very likely a bandaid and more thought is required.

Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

popcornmix pushed a commit that referenced this issue Sep 4, 2013

watchdog: delete mpcore_wdt driver
Interrupt request doesn't use the right API: The TWD watchdog uses a per-cpu
interrupt (usually interrupt #30), and the GIC configuration should flag it as
such. With this setup, request_irq() should fail, and the right API is
request_percpu_irq(), together with enable_percpu_irq()/disable_percpu_irq().

Nothing ensures the userspace ioctl() will end-up kicking the watchdog on the
right CPU.

There are no users of this driver since a long time and it makes more sense to
get rid of it as nobody is looking to fix it.

In case somebody wakes up after this has been removed and needs it, please
revert this driver and pick these updates (These were never pushed to mainline):

http://comments.gmane.org/gmane.linux.ports.arm.kernel/245998

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Wim Van Sebroeck <wim@iguana.be>

popcornmix pushed a commit that referenced this issue Jan 10, 2015

audit: use supplied gfp_mask from audit_buffer in kauditd_send_multic…
…ast_skb

commit 54dc77d upstream.

Eric Paris explains: Since kauditd_send_multicast_skb() gets called in
audit_log_end(), which can come from any context (aka even a sleeping context)
GFP_KERNEL can't be used.  Since the audit_buffer knows what context it should
use, pass that down and use that.

See: https://lkml.org/lkml/2014/12/16/542

BUG: sleeping function called from invalid context at mm/slab.c:2849
in_atomic(): 1, irqs_disabled(): 0, pid: 885, name: sulogin
2 locks held by sulogin/885:
  #0:  (&sig->cred_guard_mutex){+.+.+.}, at: [<ffffffff91152e30>] prepare_bprm_creds+0x28/0x8b
  #1:  (tty_files_lock){+.+.+.}, at: [<ffffffff9123e787>] selinux_bprm_committing_creds+0x55/0x22b
CPU: 1 PID: 885 Comm: sulogin Not tainted 3.18.0-next-20141216 #30
Hardware name: Dell Inc. Latitude E6530/07Y85M, BIOS A15 06/20/2014
  ffff880223744f10 ffff88022410f9b8 ffffffff916ba529 0000000000000375
  ffff880223744f10 ffff88022410f9e8 ffffffff91063185 0000000000000006
  0000000000000000 0000000000000000 0000000000000000 ffff88022410fa38
Call Trace:
  [<ffffffff916ba529>] dump_stack+0x50/0xa8
  [<ffffffff91063185>] ___might_sleep+0x1b6/0x1be
  [<ffffffff910632a6>] __might_sleep+0x119/0x128
  [<ffffffff91140720>] cache_alloc_debugcheck_before.isra.45+0x1d/0x1f
  [<ffffffff91141d81>] kmem_cache_alloc+0x43/0x1c9
  [<ffffffff914e148d>] __alloc_skb+0x42/0x1a3
  [<ffffffff914e2b62>] skb_copy+0x3e/0xa3
  [<ffffffff910c263e>] audit_log_end+0x83/0x100
  [<ffffffff9123b8d3>] ? avc_audit_pre_callback+0x103/0x103
  [<ffffffff91252a73>] common_lsm_audit+0x441/0x450
  [<ffffffff9123c163>] slow_avc_audit+0x63/0x67
  [<ffffffff9123c42c>] avc_has_perm+0xca/0xe3
  [<ffffffff9123dc2d>] inode_has_perm+0x5a/0x65
  [<ffffffff9123e7ca>] selinux_bprm_committing_creds+0x98/0x22b
  [<ffffffff91239e64>] security_bprm_committing_creds+0xe/0x10
  [<ffffffff911515e6>] install_exec_creds+0xe/0x79
  [<ffffffff911974cf>] load_elf_binary+0xe36/0x10d7
  [<ffffffff9115198e>] search_binary_handler+0x81/0x18c
  [<ffffffff91153376>] do_execveat_common.isra.31+0x4e3/0x7b7
  [<ffffffff91153669>] do_execve+0x1f/0x21
  [<ffffffff91153967>] SyS_execve+0x25/0x29
  [<ffffffff916c61a9>] stub_execve+0x69/0xa0

Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Tested-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

anholt referenced this issue in anholt/linux Apr 21, 2015

thermal: rcar: Fix race condition between init and interrupt
As soon as the interrupt has been enabled by devm_request_irq(), the
interrupt routine may be called, depending on the current status of the
hardware.

However, at that point rcar_thermal_common hasn't been initialized
complely yet. E.g. rcar_thermal_common.base is still NULL, causing a
NULL pointer dereference:

    Unable to handle kernel NULL pointer dereference at virtual address 0000000c
    pgd = c0004000
    [0000000c] *pgd=00000000
    Internal error: Oops: 5 [#1] SMP ARM
    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.19.0-rc7-ape6evm-04564-gb6e46cb7cbe82389 #30
    Hardware name: Generic R8A73A4 (Flattened Device Tree)
    task: ee8953c0 ti: ee896000 task.ti: ee896000
    PC is at rcar_thermal_irq+0x1c/0xf0
    LR is at _raw_spin_lock_irqsave+0x48/0x54

Postpone the call to devm_request_irq() until all initialization has
been done to fix this.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Acked-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Signed-off-by: Eduardo Valentin <edubezval@gmail.com>

davet321 pushed a commit to davet321/rpi-linux that referenced this issue Aug 17, 2015

mm, vmscan: Do not wait for page writeback for GFP_NOFS allocations
commit ecf5fc6 upstream.

Nikolay has reported a hang when a memcg reclaim got stuck with the
following backtrace:

PID: 18308  TASK: ffff883d7c9b0a30  CPU: 1   COMMAND: "rsync"
  #0 __schedule at ffffffff815ab152
  #1 schedule at ffffffff815ab76e
  #2 schedule_timeout at ffffffff815ae5e5
  #3 io_schedule_timeout at ffffffff815aad6a
  #4 bit_wait_io at ffffffff815abfc6
  #5 __wait_on_bit at ffffffff815abda5
  #6 wait_on_page_bit at ffffffff8111fd4f
  #7 shrink_page_list at ffffffff81135445
  #8 shrink_inactive_list at ffffffff81135845
  #9 shrink_lruvec at ffffffff81135ead
 #10 shrink_zone at ffffffff811360c3
 #11 shrink_zones at ffffffff81136eff
 #12 do_try_to_free_pages at ffffffff8113712f
 #13 try_to_free_mem_cgroup_pages at ffffffff811372be
 #14 try_charge at ffffffff81189423
 #15 mem_cgroup_try_charge at ffffffff8118c6f5
 #16 __add_to_page_cache_locked at ffffffff8112137d
 #17 add_to_page_cache_lru at ffffffff81121618
 #18 pagecache_get_page at ffffffff8112170b
 #19 grow_dev_page at ffffffff811c8297
 #20 __getblk_slow at ffffffff811c91d6
 #21 __getblk_gfp at ffffffff811c92c1
 #22 ext4_ext_grow_indepth at ffffffff8124565c
 #23 ext4_ext_create_new_leaf at ffffffff81246ca8
 #24 ext4_ext_insert_extent at ffffffff81246f09
 #25 ext4_ext_map_blocks at ffffffff8124a848
 #26 ext4_map_blocks at ffffffff8121a5b7
 #27 mpage_map_one_extent at ffffffff8121b1fa
 #28 mpage_map_and_submit_extent at ffffffff8121f07b
 #29 ext4_writepages at ffffffff8121f6d5
 #30 do_writepages at ffffffff8112c490
 #31 __filemap_fdatawrite_range at ffffffff81120199
 #32 filemap_flush at ffffffff8112041c
 #33 ext4_alloc_da_blocks at ffffffff81219da1
 #34 ext4_rename at ffffffff81229b91
 #35 ext4_rename2 at ffffffff81229e32
 #36 vfs_rename at ffffffff811a08a5
 #37 SYSC_renameat2 at ffffffff811a3ffc
 #38 sys_renameat2 at ffffffff811a408e
 #39 sys_rename at ffffffff8119e51e
 #40 system_call_fastpath at ffffffff815afa89

Dave Chinner has properly pointed out that this is a deadlock in the
reclaim code because ext4 doesn't submit pages which are marked by
PG_writeback right away.

The heuristic was introduced by commit e62e384 ("memcg: prevent OOM
with too many dirty pages") and it was applied only when may_enter_fs
was specified.  The code has been changed by c3b94f4 ("memcg:
further prevent OOM with too many dirty pages") which has removed the
__GFP_FS restriction with a reasoning that we do not get into the fs
code.  But this is not sufficient apparently because the fs doesn't
necessarily submit pages marked PG_writeback for IO right away.

ext4_bio_write_page calls io_submit_add_bh but that doesn't necessarily
submit the bio.  Instead it tries to map more pages into the bio and
mpage_map_one_extent might trigger memcg charge which might end up
waiting on a page which is marked PG_writeback but hasn't been submitted
yet so we would end up waiting for something that never finishes.

Fix this issue by replacing __GFP_IO by may_enter_fs check (for case 2)
before we go to wait on the writeback.  The page fault path, which is
the only path that triggers memcg oom killer since 3.12, shouldn't
require GFP_NOFS and so we shouldn't reintroduce the premature OOM
killer issue which was originally addressed by the heuristic.

As per David Chinner the xfs is doing similar thing since 2.6.15 already
so ext4 is not the only affected filesystem.  Moreover he notes:

: For example: IO completion might require unwritten extent conversion
: which executes filesystem transactions and GFP_NOFS allocations. The
: writeback flag on the pages can not be cleared until unwritten
: extent conversion completes. Hence memory reclaim cannot wait on
: page writeback to complete in GFP_NOFS context because it is not
: safe to do so, memcg reclaim or otherwise.

Cc: stable@vger.kernel.org # 3.9+
[tytso@mit.edu: corrected the control flow]
Fixes: c3b94f4 ("memcg: further prevent OOM with too many dirty pages")
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

popcornmix pushed a commit that referenced this issue Aug 20, 2015

mm, vmscan: Do not wait for page writeback for GFP_NOFS allocations
Nikolay has reported a hang when a memcg reclaim got stuck with the
following backtrace:

PID: 18308  TASK: ffff883d7c9b0a30  CPU: 1   COMMAND: "rsync"
  #0 __schedule at ffffffff815ab152
  #1 schedule at ffffffff815ab76e
  #2 schedule_timeout at ffffffff815ae5e5
  #3 io_schedule_timeout at ffffffff815aad6a
  #4 bit_wait_io at ffffffff815abfc6
  #5 __wait_on_bit at ffffffff815abda5
  #6 wait_on_page_bit at ffffffff8111fd4f
  #7 shrink_page_list at ffffffff81135445
  #8 shrink_inactive_list at ffffffff81135845
  #9 shrink_lruvec at ffffffff81135ead
 #10 shrink_zone at ffffffff811360c3
 #11 shrink_zones at ffffffff81136eff
 #12 do_try_to_free_pages at ffffffff8113712f
 #13 try_to_free_mem_cgroup_pages at ffffffff811372be
 #14 try_charge at ffffffff81189423
 #15 mem_cgroup_try_charge at ffffffff8118c6f5
 #16 __add_to_page_cache_locked at ffffffff8112137d
 #17 add_to_page_cache_lru at ffffffff81121618
 #18 pagecache_get_page at ffffffff8112170b
 #19 grow_dev_page at ffffffff811c8297
 #20 __getblk_slow at ffffffff811c91d6
 #21 __getblk_gfp at ffffffff811c92c1
 #22 ext4_ext_grow_indepth at ffffffff8124565c
 #23 ext4_ext_create_new_leaf at ffffffff81246ca8
 #24 ext4_ext_insert_extent at ffffffff81246f09
 #25 ext4_ext_map_blocks at ffffffff8124a848
 #26 ext4_map_blocks at ffffffff8121a5b7
 #27 mpage_map_one_extent at ffffffff8121b1fa
 #28 mpage_map_and_submit_extent at ffffffff8121f07b
 #29 ext4_writepages at ffffffff8121f6d5
 #30 do_writepages at ffffffff8112c490
 #31 __filemap_fdatawrite_range at ffffffff81120199
 #32 filemap_flush at ffffffff8112041c
 #33 ext4_alloc_da_blocks at ffffffff81219da1
 #34 ext4_rename at ffffffff81229b91
 #35 ext4_rename2 at ffffffff81229e32
 #36 vfs_rename at ffffffff811a08a5
 #37 SYSC_renameat2 at ffffffff811a3ffc
 #38 sys_renameat2 at ffffffff811a408e
 #39 sys_rename at ffffffff8119e51e
 #40 system_call_fastpath at ffffffff815afa89

Dave Chinner has properly pointed out that this is a deadlock in the
reclaim code because ext4 doesn't submit pages which are marked by
PG_writeback right away.

The heuristic was introduced by commit e62e384 ("memcg: prevent OOM
with too many dirty pages") and it was applied only when may_enter_fs
was specified.  The code has been changed by c3b94f4 ("memcg:
further prevent OOM with too many dirty pages") which has removed the
__GFP_FS restriction with a reasoning that we do not get into the fs
code.  But this is not sufficient apparently because the fs doesn't
necessarily submit pages marked PG_writeback for IO right away.

ext4_bio_write_page calls io_submit_add_bh but that doesn't necessarily
submit the bio.  Instead it tries to map more pages into the bio and
mpage_map_one_extent might trigger memcg charge which might end up
waiting on a page which is marked PG_writeback but hasn't been submitted
yet so we would end up waiting for something that never finishes.

Fix this issue by replacing __GFP_IO by may_enter_fs check (for case 2)
before we go to wait on the writeback.  The page fault path, which is
the only path that triggers memcg oom killer since 3.12, shouldn't
require GFP_NOFS and so we shouldn't reintroduce the premature OOM
killer issue which was originally addressed by the heuristic.

As per David Chinner the xfs is doing similar thing since 2.6.15 already
so ext4 is not the only affected filesystem.  Moreover he notes:

: For example: IO completion might require unwritten extent conversion
: which executes filesystem transactions and GFP_NOFS allocations. The
: writeback flag on the pages can not be cleared until unwritten
: extent conversion completes. Hence memory reclaim cannot wait on
: page writeback to complete in GFP_NOFS context because it is not
: safe to do so, memcg reclaim or otherwise.

Cc: stable@vger.kernel.org # 3.9+
[tytso@mit.edu: corrected the control flow]
Fixes: c3b94f4 ("memcg: further prevent OOM with too many dirty pages")
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

gumbit added a commit to gumbit/linux-rt-rpi that referenced this issue Dec 28, 2015

audit: use supplied gfp_mask from audit_buffer in kauditd_send_multic…
…ast_skb

commit 54dc77d upstream.

Eric Paris explains: Since kauditd_send_multicast_skb() gets called in
audit_log_end(), which can come from any context (aka even a sleeping context)
GFP_KERNEL can't be used.  Since the audit_buffer knows what context it should
use, pass that down and use that.

See: https://lkml.org/lkml/2014/12/16/542

BUG: sleeping function called from invalid context at mm/slab.c:2849
in_atomic(): 1, irqs_disabled(): 0, pid: 885, name: sulogin
2 locks held by sulogin/885:
  #0:  (&sig->cred_guard_mutex){+.+.+.}, at: [<ffffffff91152e30>] prepare_bprm_creds+0x28/0x8b
  #1:  (tty_files_lock){+.+.+.}, at: [<ffffffff9123e787>] selinux_bprm_committing_creds+0x55/0x22b
CPU: 1 PID: 885 Comm: sulogin Not tainted 3.18.0-next-20141216 #30
Hardware name: Dell Inc. Latitude E6530/07Y85M, BIOS A15 06/20/2014
  ffff880223744f10 ffff88022410f9b8 ffffffff916ba529 0000000000000375
  ffff880223744f10 ffff88022410f9e8 ffffffff91063185 0000000000000006
  0000000000000000 0000000000000000 0000000000000000 ffff88022410fa38
Call Trace:
  [<ffffffff916ba529>] dump_stack+0x50/0xa8
  [<ffffffff91063185>] ___might_sleep+0x1b6/0x1be
  [<ffffffff910632a6>] __might_sleep+0x119/0x128
  [<ffffffff91140720>] cache_alloc_debugcheck_before.isra.45+0x1d/0x1f
  [<ffffffff91141d81>] kmem_cache_alloc+0x43/0x1c9
  [<ffffffff914e148d>] __alloc_skb+0x42/0x1a3
  [<ffffffff914e2b62>] skb_copy+0x3e/0xa3
  [<ffffffff910c263e>] audit_log_end+0x83/0x100
  [<ffffffff9123b8d3>] ? avc_audit_pre_callback+0x103/0x103
  [<ffffffff91252a73>] common_lsm_audit+0x441/0x450
  [<ffffffff9123c163>] slow_avc_audit+0x63/0x67
  [<ffffffff9123c42c>] avc_has_perm+0xca/0xe3
  [<ffffffff9123dc2d>] inode_has_perm+0x5a/0x65
  [<ffffffff9123e7ca>] selinux_bprm_committing_creds+0x98/0x22b
  [<ffffffff91239e64>] security_bprm_committing_creds+0xe/0x10
  [<ffffffff911515e6>] install_exec_creds+0xe/0x79
  [<ffffffff911974cf>] load_elf_binary+0xe36/0x10d7
  [<ffffffff9115198e>] search_binary_handler+0x81/0x18c
  [<ffffffff91153376>] do_execveat_common.isra.31+0x4e3/0x7b7
  [<ffffffff91153669>] do_execve+0x1f/0x21
  [<ffffffff91153967>] SyS_execve+0x25/0x29
  [<ffffffff916c61a9>] stub_execve+0x69/0xa0

Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Tested-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Paul Moore <pmoore@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

giraldeau pushed a commit to giraldeau/linux that referenced this issue Apr 12, 2016

rcu: Frob softirq test
With RT_FULL we get the below wreckage:

[  126.060484] =======================================================
[  126.060486] [ INFO: possible circular locking dependency detected ]
[  126.060489] 3.0.1-rt10+ #30
[  126.060490] -------------------------------------------------------
[  126.060492] irq/24-eth0/1235 is trying to acquire lock:
[  126.060495]  (&(lock)->wait_lock#2){+.+...}, at: [<ffffffff81501c81>] rt_mutex_slowunlock+0x16/0x55
[  126.060503] 
[  126.060504] but task is already holding lock:
[  126.060506]  (&p->pi_lock){-...-.}, at: [<ffffffff81074fdc>] try_to_wake_up+0x35/0x429
[  126.060511] 
[  126.060511] which lock already depends on the new lock.
[  126.060513] 
[  126.060514] 
[  126.060514] the existing dependency chain (in reverse order) is:
[  126.060516] 
[  126.060516] -> #1 (&p->pi_lock){-...-.}:
[  126.060519]        [<ffffffff810afe9e>] lock_acquire+0x145/0x18a
[  126.060524]        [<ffffffff8150291e>] _raw_spin_lock_irqsave+0x4b/0x85
[  126.060527]        [<ffffffff810b5aa4>] task_blocks_on_rt_mutex+0x36/0x20f
[  126.060531]        [<ffffffff815019bb>] rt_mutex_slowlock+0xd1/0x15a
[  126.060534]        [<ffffffff81501ae3>] rt_mutex_lock+0x2d/0x2f
[  126.060537]        [<ffffffff810d9020>] rcu_boost+0xad/0xde
[  126.060541]        [<ffffffff810d90ce>] rcu_boost_kthread+0x7d/0x9b
[  126.060544]        [<ffffffff8109a760>] kthread+0x99/0xa1
[  126.060547]        [<ffffffff81509b14>] kernel_thread_helper+0x4/0x10
[  126.060551] 
[  126.060552] -> #0 (&(lock)->wait_lock#2){+.+...}:
[  126.060555]        [<ffffffff810af1b8>] __lock_acquire+0x1157/0x1816
[  126.060558]        [<ffffffff810afe9e>] lock_acquire+0x145/0x18a
[  126.060561]        [<ffffffff8150279e>] _raw_spin_lock+0x40/0x73
[  126.060564]        [<ffffffff81501c81>] rt_mutex_slowunlock+0x16/0x55
[  126.060566]        [<ffffffff81501ce7>] rt_mutex_unlock+0x27/0x29
[  126.060569]        [<ffffffff810d9f86>] rcu_read_unlock_special+0x17e/0x1c4
[  126.060573]        [<ffffffff810da014>] __rcu_read_unlock+0x48/0x89
[  126.060576]        [<ffffffff8106847a>] select_task_rq_rt+0xc7/0xd5
[  126.060580]        [<ffffffff8107511c>] try_to_wake_up+0x175/0x429
[  126.060583]        [<ffffffff81075425>] wake_up_process+0x15/0x17
[  126.060585]        [<ffffffff81080a51>] wakeup_softirqd+0x24/0x26
[  126.060590]        [<ffffffff81081df9>] irq_exit+0x49/0x55
[  126.060593]        [<ffffffff8150a3bd>] smp_apic_timer_interrupt+0x8a/0x98
[  126.060597]        [<ffffffff81509793>] apic_timer_interrupt+0x13/0x20
[  126.060600]        [<ffffffff810d5952>] irq_forced_thread_fn+0x1b/0x44
[  126.060603]        [<ffffffff810d582c>] irq_thread+0xde/0x1af
[  126.060606]        [<ffffffff8109a760>] kthread+0x99/0xa1
[  126.060608]        [<ffffffff81509b14>] kernel_thread_helper+0x4/0x10
[  126.060611] 
[  126.060612] other info that might help us debug this:
[  126.060614] 
[  126.060615]  Possible unsafe locking scenario:
[  126.060616] 
[  126.060617]        CPU0                    CPU1
[  126.060619]        ----                    ----
[  126.060620]   lock(&p->pi_lock);
[  126.060623]                                lock(&(lock)->wait_lock);
[  126.060625]                                lock(&p->pi_lock);
[  126.060627]   lock(&(lock)->wait_lock);
[  126.060629] 
[  126.060629]  *** DEADLOCK ***
[  126.060630] 
[  126.060632] 1 lock held by irq/24-eth0/1235:
[  126.060633]  #0:  (&p->pi_lock){-...-.}, at: [<ffffffff81074fdc>] try_to_wake_up+0x35/0x429
[  126.060638] 
[  126.060638] stack backtrace:
[  126.060641] Pid: 1235, comm: irq/24-eth0 Not tainted 3.0.1-rt10+ #30
[  126.060643] Call Trace:
[  126.060644]  <IRQ>  [<ffffffff810acbde>] print_circular_bug+0x289/0x29a
[  126.060651]  [<ffffffff810af1b8>] __lock_acquire+0x1157/0x1816
[  126.060655]  [<ffffffff810ab3aa>] ? trace_hardirqs_off_caller+0x1f/0x99
[  126.060658]  [<ffffffff81501c81>] ? rt_mutex_slowunlock+0x16/0x55
[  126.060661]  [<ffffffff810afe9e>] lock_acquire+0x145/0x18a
[  126.060664]  [<ffffffff81501c81>] ? rt_mutex_slowunlock+0x16/0x55
[  126.060668]  [<ffffffff8150279e>] _raw_spin_lock+0x40/0x73
[  126.060671]  [<ffffffff81501c81>] ? rt_mutex_slowunlock+0x16/0x55
[  126.060674]  [<ffffffff810d9655>] ? rcu_report_qs_rsp+0x87/0x8c
[  126.060677]  [<ffffffff81501c81>] rt_mutex_slowunlock+0x16/0x55
[  126.060680]  [<ffffffff810d9ea3>] ? rcu_read_unlock_special+0x9b/0x1c4
[  126.060683]  [<ffffffff81501ce7>] rt_mutex_unlock+0x27/0x29
[  126.060687]  [<ffffffff810d9f86>] rcu_read_unlock_special+0x17e/0x1c4
[  126.060690]  [<ffffffff810da014>] __rcu_read_unlock+0x48/0x89
[  126.060693]  [<ffffffff8106847a>] select_task_rq_rt+0xc7/0xd5
[  126.060696]  [<ffffffff810683da>] ? select_task_rq_rt+0x27/0xd5
[  126.060701]  [<ffffffff810a852a>] ? clockevents_program_event+0x8e/0x90
[  126.060704]  [<ffffffff8107511c>] try_to_wake_up+0x175/0x429
[  126.060708]  [<ffffffff810a95dc>] ? tick_program_event+0x1f/0x21
[  126.060711]  [<ffffffff81075425>] wake_up_process+0x15/0x17
[  126.060715]  [<ffffffff81080a51>] wakeup_softirqd+0x24/0x26
[  126.060718]  [<ffffffff81081df9>] irq_exit+0x49/0x55
[  126.060721]  [<ffffffff8150a3bd>] smp_apic_timer_interrupt+0x8a/0x98
[  126.060724]  [<ffffffff81509793>] apic_timer_interrupt+0x13/0x20
[  126.060726]  <EOI>  [<ffffffff81072855>] ? migrate_disable+0x75/0x12d
[  126.060733]  [<ffffffff81080a61>] ? local_bh_disable+0xe/0x1f
[  126.060736]  [<ffffffff81080a70>] ? local_bh_disable+0x1d/0x1f
[  126.060739]  [<ffffffff810d5952>] irq_forced_thread_fn+0x1b/0x44
[  126.060742]  [<ffffffff81502ac0>] ? _raw_spin_unlock_irq+0x3b/0x59
[  126.060745]  [<ffffffff810d582c>] irq_thread+0xde/0x1af
[  126.060748]  [<ffffffff810d5937>] ? irq_thread_fn+0x3a/0x3a
[  126.060751]  [<ffffffff810d574e>] ? irq_finalize_oneshot+0xd1/0xd1
[  126.060754]  [<ffffffff810d574e>] ? irq_finalize_oneshot+0xd1/0xd1
[  126.060757]  [<ffffffff8109a760>] kthread+0x99/0xa1
[  126.060761]  [<ffffffff81509b14>] kernel_thread_helper+0x4/0x10
[  126.060764]  [<ffffffff81069ed7>] ? finish_task_switch+0x87/0x10a
[  126.060768]  [<ffffffff81502ec4>] ? retint_restore_args+0xe/0xe
[  126.060771]  [<ffffffff8109a6c7>] ? __init_kthread_worker+0x8c/0x8c
[  126.060774]  [<ffffffff81509b10>] ? gs_change+0xb/0xb

Because irq_exit() does:

void irq_exit(void)
{
	account_system_vtime(current);
	trace_hardirq_exit();
	sub_preempt_count(IRQ_EXIT_OFFSET);
	if (!in_interrupt() && local_softirq_pending())
		invoke_softirq();

	...
}

Which triggers a wakeup, which uses RCU, now if the interrupted task has
t->rcu_read_unlock_special set, the rcu usage from the wakeup will end
up in rcu_read_unlock_special(). rcu_read_unlock_special() will test
for in_irq(), which will fail as we just decremented preempt_count
with IRQ_EXIT_OFFSET, and in_sering_softirq(), which for
PREEMPT_RT_FULL reads:

int in_serving_softirq(void)
{
	int res;

	preempt_disable();
	res = __get_cpu_var(local_softirq_runner) == current;
	preempt_enable();
	return res;
}

Which will thus also fail, resulting in the above wreckage.

The 'somewhat' ugly solution is to open-code the preempt_count() test
in rcu_read_unlock_special().

Also, we're not at all sure how ->rcu_read_unlock_special gets set
here... so this is very likely a bandaid and more thought is required.

Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

davet321 pushed a commit to davet321/rpi-linux that referenced this issue May 16, 2016

arm64: bpf: jit JMP_JSET_{X,K}
Original implementation commit e54bcde ("arm64: eBPF JIT compiler")
had the relevant code paths, but due to an oversight always fail jiting.

As a result, we had been falling back to BPF interpreter whenever a BPF
program has JMP_JSET_{X,K} instructions.

With this fix, we confirm that the corresponding tests in lib/test_bpf
continue to pass, and also jited.

...
[    2.784553] test_bpf: #30 JSET jited:1 188 192 197 PASS
[    2.791373] test_bpf: #31 tcpdump port 22 jited:1 325 677 625 PASS
[    2.808800] test_bpf: #32 tcpdump complex jited:1 323 731 991 PASS
...
[    3.190759] test_bpf: #237 JMP_JSET_K: if (0x3 & 0x2) return 1 jited:1 110 PASS
[    3.192524] test_bpf: #238 JMP_JSET_K: if (0x3 & 0xffffffff) return 1 jited:1 98 PASS
[    3.211014] test_bpf: #249 JMP_JSET_X: if (0x3 & 0x2) return 1 jited:1 120 PASS
[    3.212973] test_bpf: #250 JMP_JSET_X: if (0x3 & 0xffffffff) return 1 jited:1 89 PASS
...

Fixes: e54bcde ("arm64: eBPF JIT compiler")
Signed-off-by: Zi Shen Lim <zlim.lnx@gmail.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Acked-by: Yang Shi <yang.shi@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlyle added a commit to d-ronin/linux that referenced this issue Aug 21, 2016

rcu: Frob softirq test
With RT_FULL we get the below wreckage:

[  126.060484] =======================================================
[  126.060486] [ INFO: possible circular locking dependency detected ]
[  126.060489] 3.0.1-rt10+ #30
[  126.060490] -------------------------------------------------------
[  126.060492] irq/24-eth0/1235 is trying to acquire lock:
[  126.060495]  (&(lock)->wait_lock#2){+.+...}, at: [<ffffffff81501c81>] rt_mutex_slowunlock+0x16/0x55
[  126.060503]
[  126.060504] but task is already holding lock:
[  126.060506]  (&p->pi_lock){-...-.}, at: [<ffffffff81074fdc>] try_to_wake_up+0x35/0x429
[  126.060511]
[  126.060511] which lock already depends on the new lock.
[  126.060513]
[  126.060514]
[  126.060514] the existing dependency chain (in reverse order) is:
[  126.060516]
[  126.060516] -> #1 (&p->pi_lock){-...-.}:
[  126.060519]        [<ffffffff810afe9e>] lock_acquire+0x145/0x18a
[  126.060524]        [<ffffffff8150291e>] _raw_spin_lock_irqsave+0x4b/0x85
[  126.060527]        [<ffffffff810b5aa4>] task_blocks_on_rt_mutex+0x36/0x20f
[  126.060531]        [<ffffffff815019bb>] rt_mutex_slowlock+0xd1/0x15a
[  126.060534]        [<ffffffff81501ae3>] rt_mutex_lock+0x2d/0x2f
[  126.060537]        [<ffffffff810d9020>] rcu_boost+0xad/0xde
[  126.060541]        [<ffffffff810d90ce>] rcu_boost_kthread+0x7d/0x9b
[  126.060544]        [<ffffffff8109a760>] kthread+0x99/0xa1
[  126.060547]        [<ffffffff81509b14>] kernel_thread_helper+0x4/0x10
[  126.060551]
[  126.060552] -> #0 (&(lock)->wait_lock#2){+.+...}:
[  126.060555]        [<ffffffff810af1b8>] __lock_acquire+0x1157/0x1816
[  126.060558]        [<ffffffff810afe9e>] lock_acquire+0x145/0x18a
[  126.060561]        [<ffffffff8150279e>] _raw_spin_lock+0x40/0x73
[  126.060564]        [<ffffffff81501c81>] rt_mutex_slowunlock+0x16/0x55
[  126.060566]        [<ffffffff81501ce7>] rt_mutex_unlock+0x27/0x29
[  126.060569]        [<ffffffff810d9f86>] rcu_read_unlock_special+0x17e/0x1c4
[  126.060573]        [<ffffffff810da014>] __rcu_read_unlock+0x48/0x89
[  126.060576]        [<ffffffff8106847a>] select_task_rq_rt+0xc7/0xd5
[  126.060580]        [<ffffffff8107511c>] try_to_wake_up+0x175/0x429
[  126.060583]        [<ffffffff81075425>] wake_up_process+0x15/0x17
[  126.060585]        [<ffffffff81080a51>] wakeup_softirqd+0x24/0x26
[  126.060590]        [<ffffffff81081df9>] irq_exit+0x49/0x55
[  126.060593]        [<ffffffff8150a3bd>] smp_apic_timer_interrupt+0x8a/0x98
[  126.060597]        [<ffffffff81509793>] apic_timer_interrupt+0x13/0x20
[  126.060600]        [<ffffffff810d5952>] irq_forced_thread_fn+0x1b/0x44
[  126.060603]        [<ffffffff810d582c>] irq_thread+0xde/0x1af
[  126.060606]        [<ffffffff8109a760>] kthread+0x99/0xa1
[  126.060608]        [<ffffffff81509b14>] kernel_thread_helper+0x4/0x10
[  126.060611]
[  126.060612] other info that might help us debug this:
[  126.060614]
[  126.060615]  Possible unsafe locking scenario:
[  126.060616]
[  126.060617]        CPU0                    CPU1
[  126.060619]        ----                    ----
[  126.060620]   lock(&p->pi_lock);
[  126.060623]                                lock(&(lock)->wait_lock);
[  126.060625]                                lock(&p->pi_lock);
[  126.060627]   lock(&(lock)->wait_lock);
[  126.060629]
[  126.060629]  *** DEADLOCK ***
[  126.060630]
[  126.060632] 1 lock held by irq/24-eth0/1235:
[  126.060633]  #0:  (&p->pi_lock){-...-.}, at: [<ffffffff81074fdc>] try_to_wake_up+0x35/0x429
[  126.060638]
[  126.060638] stack backtrace:
[  126.060641] Pid: 1235, comm: irq/24-eth0 Not tainted 3.0.1-rt10+ #30
[  126.060643] Call Trace:
[  126.060644]  <IRQ>  [<ffffffff810acbde>] print_circular_bug+0x289/0x29a
[  126.060651]  [<ffffffff810af1b8>] __lock_acquire+0x1157/0x1816
[  126.060655]  [<ffffffff810ab3aa>] ? trace_hardirqs_off_caller+0x1f/0x99
[  126.060658]  [<ffffffff81501c81>] ? rt_mutex_slowunlock+0x16/0x55
[  126.060661]  [<ffffffff810afe9e>] lock_acquire+0x145/0x18a
[  126.060664]  [<ffffffff81501c81>] ? rt_mutex_slowunlock+0x16/0x55
[  126.060668]  [<ffffffff8150279e>] _raw_spin_lock+0x40/0x73
[  126.060671]  [<ffffffff81501c81>] ? rt_mutex_slowunlock+0x16/0x55
[  126.060674]  [<ffffffff810d9655>] ? rcu_report_qs_rsp+0x87/0x8c
[  126.060677]  [<ffffffff81501c81>] rt_mutex_slowunlock+0x16/0x55
[  126.060680]  [<ffffffff810d9ea3>] ? rcu_read_unlock_special+0x9b/0x1c4
[  126.060683]  [<ffffffff81501ce7>] rt_mutex_unlock+0x27/0x29
[  126.060687]  [<ffffffff810d9f86>] rcu_read_unlock_special+0x17e/0x1c4
[  126.060690]  [<ffffffff810da014>] __rcu_read_unlock+0x48/0x89
[  126.060693]  [<ffffffff8106847a>] select_task_rq_rt+0xc7/0xd5
[  126.060696]  [<ffffffff810683da>] ? select_task_rq_rt+0x27/0xd5
[  126.060701]  [<ffffffff810a852a>] ? clockevents_program_event+0x8e/0x90
[  126.060704]  [<ffffffff8107511c>] try_to_wake_up+0x175/0x429
[  126.060708]  [<ffffffff810a95dc>] ? tick_program_event+0x1f/0x21
[  126.060711]  [<ffffffff81075425>] wake_up_process+0x15/0x17
[  126.060715]  [<ffffffff81080a51>] wakeup_softirqd+0x24/0x26
[  126.060718]  [<ffffffff81081df9>] irq_exit+0x49/0x55
[  126.060721]  [<ffffffff8150a3bd>] smp_apic_timer_interrupt+0x8a/0x98
[  126.060724]  [<ffffffff81509793>] apic_timer_interrupt+0x13/0x20
[  126.060726]  <EOI>  [<ffffffff81072855>] ? migrate_disable+0x75/0x12d
[  126.060733]  [<ffffffff81080a61>] ? local_bh_disable+0xe/0x1f
[  126.060736]  [<ffffffff81080a70>] ? local_bh_disable+0x1d/0x1f
[  126.060739]  [<ffffffff810d5952>] irq_forced_thread_fn+0x1b/0x44
[  126.060742]  [<ffffffff81502ac0>] ? _raw_spin_unlock_irq+0x3b/0x59
[  126.060745]  [<ffffffff810d582c>] irq_thread+0xde/0x1af
[  126.060748]  [<ffffffff810d5937>] ? irq_thread_fn+0x3a/0x3a
[  126.060751]  [<ffffffff810d574e>] ? irq_finalize_oneshot+0xd1/0xd1
[  126.060754]  [<ffffffff810d574e>] ? irq_finalize_oneshot+0xd1/0xd1
[  126.060757]  [<ffffffff8109a760>] kthread+0x99/0xa1
[  126.060761]  [<ffffffff81509b14>] kernel_thread_helper+0x4/0x10
[  126.060764]  [<ffffffff81069ed7>] ? finish_task_switch+0x87/0x10a
[  126.060768]  [<ffffffff81502ec4>] ? retint_restore_args+0xe/0xe
[  126.060771]  [<ffffffff8109a6c7>] ? __init_kthread_worker+0x8c/0x8c
[  126.060774]  [<ffffffff81509b10>] ? gs_change+0xb/0xb

Because irq_exit() does:

void irq_exit(void)
{
	account_system_vtime(current);
	trace_hardirq_exit();
	sub_preempt_count(IRQ_EXIT_OFFSET);
	if (!in_interrupt() && local_softirq_pending())
		invoke_softirq();

	...
}

Which triggers a wakeup, which uses RCU, now if the interrupted task has
t->rcu_read_unlock_special set, the rcu usage from the wakeup will end
up in rcu_read_unlock_special(). rcu_read_unlock_special() will test
for in_irq(), which will fail as we just decremented preempt_count
with IRQ_EXIT_OFFSET, and in_sering_softirq(), which for
PREEMPT_RT_FULL reads:

int in_serving_softirq(void)
{
	int res;

	preempt_disable();
	res = __get_cpu_var(local_softirq_runner) == current;
	preempt_enable();
	return res;
}

Which will thus also fail, resulting in the above wreckage.

The 'somewhat' ugly solution is to open-code the preempt_count() test
in rcu_read_unlock_special().

Also, we're not at all sure how ->rcu_read_unlock_special gets set
here... so this is very likely a bandaid and more thought is required.

Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>

popcornmix pushed a commit that referenced this issue Jan 30, 2017

userfaultfd: fix SIGBUS resulting from false rwsem wakeups
With >=32 CPUs the userfaultfd selftest triggered a graceful but
unexpected SIGBUS because VM_FAULT_RETRY was returned by
handle_userfault() despite the UFFDIO_COPY wasn't completed.

This seems caused by rwsem waking the thread blocked in
handle_userfault() and we can't run up_read() before the wait_event
sequence is complete.

Keeping the wait_even sequence identical to the first one, would require
running userfaultfd_must_wait() again to know if the loop should be
repeated, and it would also require retaking the rwsem and revalidating
the whole vma status.

It seems simpler to wait the targeted wakeup so that if false wakeups
materialize we still wait for our specific wakeup event, unless of
course there are signals or the uffd was released.

Debug code collecting the stack trace of the wakeup showed this:

  $ ./userfaultfd 100 99999
  nr_pages: 25600, nr_pages_per_cpu: 800
  bounces: 99998, mode: racing ver poll, userfaults: 32 35 90 232 30 138 69 82 34 30 139 40 40 31 20 19 43 13 15 28 27 38 21 43 56 22 1 17 31 8 4 2
  bounces: 99997, mode: rnd ver poll, Bus error (core dumped)

    save_stack_trace+0x2b/0x50
    try_to_wake_up+0x2a6/0x580
    wake_up_q+0x32/0x70
    rwsem_wake+0xe0/0x120
    call_rwsem_wake+0x1b/0x30
    up_write+0x3b/0x40
    vm_mmap_pgoff+0x9c/0xc0
    SyS_mmap_pgoff+0x1a9/0x240
    SyS_mmap+0x22/0x30
    entry_SYSCALL_64_fastpath+0x1f/0xbd
    0xffffffffffffffff
    FAULT_FLAG_ALLOW_RETRY missing 70
  CPU: 24 PID: 1054 Comm: userfaultfd Tainted: G        W       4.8.0+ #30
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
  Call Trace:
    dump_stack+0xb8/0x112
    handle_userfault+0x572/0x650
    handle_mm_fault+0x12cb/0x1520
    __do_page_fault+0x175/0x500
    trace_do_page_fault+0x61/0x270
    do_async_page_fault+0x19/0x90
    async_page_fault+0x25/0x30

This always happens when the main userfault selftest thread is running
clone() while glibc runs either mprotect or mmap (both taking mmap_sem
down_write()) to allocate the thread stack of the background threads,
while locking/userfault threads already run at full throttle and are
susceptible to false wakeups that may cause handle_userfault() to return
before than expected (which results in graceful SIGBUS at the next
attempt).

This was reproduced only with >=32 CPUs because the loop to start the
thread where clone() is too quick with fewer CPUs, while with 32 CPUs
there's already significant activity on ~32 locking and userfault
threads when the last background threads are started with clone().

This >=32 CPUs SMP race condition is likely reproducible only with the
selftest because of the much heavier userfault load it generates if
compared to real apps.

We'll have to allow "one more" VM_FAULT_RETRY for the WP support and a
patch floating around that provides it also hidden this problem but in
reality only is successfully at hiding the problem.

False wakeups could still happen again the second time
handle_userfault() is invoked, even if it's a so rare race condition
that getting false wakeups twice in a row is impossible to reproduce.
This full fix is needed for correctness, the only alternative would be
to allow VM_FAULT_RETRY to be returned infinitely.  With this fix the WP
support can stick to a strict "one more" VM_FAULT_RETRY logic (no need
of returning it infinite times to avoid the SIGBUS).

Link: http://lkml.kernel.org/r/20170111005535.13832-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Shubham Kumar Sharma <shubham.kumar.sharma@oracle.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

popcornmix pushed a commit that referenced this issue Feb 6, 2017

firmware: fix NULL pointer dereference in __fw_load_abort()
Since commit 5d47ec0 ("firmware: Correct handling of
fw_state_wait() return value") fw_load_abort() could be called twice and
lead us to a kernel crash. This happens only when the firmware fallback
mechanism (regular or custom) is used. The fallback mechanism exposes a
sysfs interface for userspace to upload a file and notify the kernel
when the file is loaded and ready, or to cancel an upload by echo'ing -1
into on the loading file:

echo -n "-1" > /sys/$DEVPATH/loading

This will call fw_load_abort(). Some distributions actually have a udev
rule in place to *always* immediately cancel all firmware fallback
mechanism requests (Debian), they have:

  $ cat /lib/udev/rules.d/50-firmware.rules
  # stub for immediately telling the kernel that userspace firmware loading
  # failed; necessary to avoid long timeouts with CONFIG_FW_LOADER_USER_HELPER=y
  SUBSYSTEM=="firmware", ACTION=="add", ATTR{loading}="-1

Distributions with this udev rule would run into this crash only if the
fallback mechanism is used. Since most distributions disable by default
using the fallback mechanism (CONFIG_FW_LOADER_USER_HELPER_FALLBACK),
this would typicaly mean only 2 drivers which *require* the fallback
mechanism could typically incur a crash: drivers/firmware/dell_rbu.c and
the drivers/leds/leds-lp55xx-common.c driver. Distributions enabling
CONFIG_FW_LOADER_USER_HELPER_FALLBACK by default are obviously more
exposed to this crash.

The crash happens because after commit 5b02962 ("firmware: do not
use fw_lock for fw_state protection") and subsequent fix commit
5d47ec0 ("firmware: Correct handling of fw_state_wait() return
value") a race can happen between this cancelation and the firmware
fw_state_wait_timeout() being woken up after a state change with which
fw_load_abort() as that calls swake_up(). Upon error
fw_state_wait_timeout() will also again call fw_load_abort() and trigger
a null reference.

At first glance we could just fix this with a !buf check on
fw_load_abort() before accessing buf->fw_st, however there is a logical
issue in having a state machine used for the fallback mechanism and
preventing access from it once we abort as its inside the buf
(buf->fw_st).

The firmware_class.c code is setting the buf to NULL to annotate an
abort has occurred. Replace this mechanism by simply using the state
check instead. All the other code in place already uses similar checks
for aborting as well so no further changes are needed.

An oops can be reproduced with the new fw_fallback.sh fallback mechanism
cancellation test. Either cancelling the fallback mechanism or the
custom fallback mechanism triggers a crash.

mcgrof@piggy ~/linux-next/tools/testing/selftests/firmware
(git::20170111-fw-fixes)$ sudo ./fw_fallback.sh

./fw_fallback.sh: timeout works
./fw_fallback.sh: firmware comparison works
./fw_fallback.sh: fallback mechanism works

[ this then sits here when it is trying the cancellation test ]

Kernel log:

test_firmware: loading 'nope-test-firmware.bin'
misc test_firmware: Direct firmware load for nope-test-firmware.bin failed with error -2
misc test_firmware: Falling back to user helper
BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
IP: _request_firmware+0xa27/0xad0
PGD 0

Oops: 0000 [#1] SMP
Modules linked in: test_firmware(E) ... etc ...
CPU: 1 PID: 1396 Comm: fw_fallback.sh Tainted: G        W E   4.10.0-rc3-next-20170111+ #30
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
task: ffff9740b27f4340 task.stack: ffffbb15c0bc8000
RIP: 0010:_request_firmware+0xa27/0xad0
RSP: 0018:ffffbb15c0bcbd10 EFLAGS: 00010246
RAX: 00000000fffffffe RBX: ffff9740afe5aa80 RCX: 0000000000000000
RDX: ffff9740b27f4340 RSI: 0000000000000283 RDI: 0000000000000000
RBP: ffffbb15c0bcbd90 R08: ffffbb15c0bcbcd8 R09: 0000000000000000
R10: 0000000894a0d4b1 R11: 000000000000008c R12: ffffffffc0312480
R13: 0000000000000005 R14: ffff9740b1c32400 R15: 00000000000003e8
FS:  00007f8604422700(0000) GS:ffff9740bfc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000038 CR3: 000000012164c000 CR4: 00000000000006e0
Call Trace:
 request_firmware+0x37/0x50
 trigger_request_store+0x79/0xd0 [test_firmware]
 dev_attr_store+0x18/0x30
 sysfs_kf_write+0x37/0x40
 kernfs_fop_write+0x110/0x1a0
 __vfs_write+0x37/0x160
 ? _cond_resched+0x1a/0x50
 vfs_write+0xb5/0x1a0
 SyS_write+0x55/0xc0
 ? trace_do_page_fault+0x37/0xd0
 entry_SYSCALL_64_fastpath+0x1e/0xad
RIP: 0033:0x7f8603f49620
RSP: 002b:00007fff6287b788 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000055c307b110a0 RCX: 00007f8603f49620
RDX: 0000000000000016 RSI: 000055c3084d8a90 RDI: 0000000000000001
RBP: 0000000000000016 R08: 000000000000c0ff R09: 000055c3084d6336
R10: 000055c307b108b0 R11: 0000000000000246 R12: 000055c307b13c80
R13: 000055c3084d6320 R14: 0000000000000000 R15: 00007fff6287b950
Code: 9f 64 84 e8 9c 61 fe ff b8 f4 ff ff ff e9 6b f9 ff
ff 48 c7 c7 40 6b 8d 84 89 45 a8 e8 43 84 18 00 49 8b be 00 03 00 00 8b
45 a8 <83> 7f 38 02 74 08 e8 6e ec ff ff 8b 45 a8 49 c7 86 00 03 00 00
RIP: _request_firmware+0xa27/0xad0 RSP: ffffbb15c0bcbd10
CR2: 0000000000000038
---[ end trace 6d94ac339c133e6f ]---

Fixes: 5d47ec0 ("firmware: Correct handling of fw_state_wait() return value")
Reported-and-Tested-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reported-and-Tested-by: Patrick Bruenn <p.bruenn@beckhoff.com>
Reported-by: Chris Wilson <chris@chris-wilson.co.uk>
CC: <stable@vger.kernel.org>    [3.10+]
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

popcornmix pushed a commit that referenced this issue Jun 17, 2017

userfaultfd: fix SIGBUS resulting from false rwsem wakeups
[ Upstream commit 15a77c6 ]

With >=32 CPUs the userfaultfd selftest triggered a graceful but
unexpected SIGBUS because VM_FAULT_RETRY was returned by
handle_userfault() despite the UFFDIO_COPY wasn't completed.

This seems caused by rwsem waking the thread blocked in
handle_userfault() and we can't run up_read() before the wait_event
sequence is complete.

Keeping the wait_even sequence identical to the first one, would require
running userfaultfd_must_wait() again to know if the loop should be
repeated, and it would also require retaking the rwsem and revalidating
the whole vma status.

It seems simpler to wait the targeted wakeup so that if false wakeups
materialize we still wait for our specific wakeup event, unless of
course there are signals or the uffd was released.

Debug code collecting the stack trace of the wakeup showed this:

  $ ./userfaultfd 100 99999
  nr_pages: 25600, nr_pages_per_cpu: 800
  bounces: 99998, mode: racing ver poll, userfaults: 32 35 90 232 30 138 69 82 34 30 139 40 40 31 20 19 43 13 15 28 27 38 21 43 56 22 1 17 31 8 4 2
  bounces: 99997, mode: rnd ver poll, Bus error (core dumped)

    save_stack_trace+0x2b/0x50
    try_to_wake_up+0x2a6/0x580
    wake_up_q+0x32/0x70
    rwsem_wake+0xe0/0x120
    call_rwsem_wake+0x1b/0x30
    up_write+0x3b/0x40
    vm_mmap_pgoff+0x9c/0xc0
    SyS_mmap_pgoff+0x1a9/0x240
    SyS_mmap+0x22/0x30
    entry_SYSCALL_64_fastpath+0x1f/0xbd
    0xffffffffffffffff
    FAULT_FLAG_ALLOW_RETRY missing 70
  CPU: 24 PID: 1054 Comm: userfaultfd Tainted: G        W       4.8.0+ #30
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
  Call Trace:
    dump_stack+0xb8/0x112
    handle_userfault+0x572/0x650
    handle_mm_fault+0x12cb/0x1520
    __do_page_fault+0x175/0x500
    trace_do_page_fault+0x61/0x270
    do_async_page_fault+0x19/0x90
    async_page_fault+0x25/0x30

This always happens when the main userfault selftest thread is running
clone() while glibc runs either mprotect or mmap (both taking mmap_sem
down_write()) to allocate the thread stack of the background threads,
while locking/userfault threads already run at full throttle and are
susceptible to false wakeups that may cause handle_userfault() to return
before than expected (which results in graceful SIGBUS at the next
attempt).

This was reproduced only with >=32 CPUs because the loop to start the
thread where clone() is too quick with fewer CPUs, while with 32 CPUs
there's already significant activity on ~32 locking and userfault
threads when the last background threads are started with clone().

This >=32 CPUs SMP race condition is likely reproducible only with the
selftest because of the much heavier userfault load it generates if
compared to real apps.

We'll have to allow "one more" VM_FAULT_RETRY for the WP support and a
patch floating around that provides it also hidden this problem but in
reality only is successfully at hiding the problem.

False wakeups could still happen again the second time
handle_userfault() is invoked, even if it's a so rare race condition
that getting false wakeups twice in a row is impossible to reproduce.
This full fix is needed for correctness, the only alternative would be
to allow VM_FAULT_RETRY to be returned infinitely.  With this fix the WP
support can stick to a strict "one more" VM_FAULT_RETRY logic (no need
of returning it infinite times to avoid the SIGBUS).

Link: http://lkml.kernel.org/r/20170111005535.13832-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Shubham Kumar Sharma <shubham.kumar.sharma@oracle.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

popcornmix pushed a commit that referenced this issue Jul 29, 2017

f2fs: use spin_{,un}lock_irq{save,restore}
commit d1aa245 upstream.

generic/361 reports below warning, this is because: once, there is
someone entering into critical region of sbi.cp_lock, if write_end_io.
f2fs_stop_checkpoint is invoked from an triggered IRQ, we will encounter
deadlock.

So this patch changes to use spin_{,un}lock_irq{save,restore} to create
critical region without IRQ enabled to avoid potential deadlock.

 irq event stamp: 83391573
 loop: Write error at byte offset 438729728, length 1024.
 hardirqs last  enabled at (83391573): [<c1809752>] restore_all+0xf/0x65
 hardirqs last disabled at (83391572): [<c1809eac>] reschedule_interrupt+0x30/0x3c
 loop: Write error at byte offset 438860288, length 1536.
 softirqs last  enabled at (83389244): [<c180cc4e>] __do_softirq+0x1ae/0x476
 softirqs last disabled at (83389237): [<c101ca7c>] do_softirq_own_stack+0x2c/0x40
 loop: Write error at byte offset 438990848, length 2048.
 ================================
 WARNING: inconsistent lock state
 4.12.0-rc2+ #30 Tainted: G           O
 --------------------------------
 inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
 xfs_io/7959 [HC1[1]:SC0[0]:HE0:SE1] takes:
  (&(&sbi->cp_lock)->rlock){?.+...}, at: [<f96f96cc>] f2fs_stop_checkpoint+0x1c/0x50 [f2fs]
 {HARDIRQ-ON-W} state was registered at:
   __lock_acquire+0x527/0x7b0
   lock_acquire+0xae/0x220
   _raw_spin_lock+0x42/0x50
   do_checkpoint+0x165/0x9e0 [f2fs]
   write_checkpoint+0x33f/0x740 [f2fs]
   __f2fs_sync_fs+0x92/0x1f0 [f2fs]
   f2fs_sync_fs+0x12/0x20 [f2fs]
   sync_filesystem+0x67/0x80
   generic_shutdown_super+0x27/0x100
   kill_block_super+0x22/0x50
   kill_f2fs_super+0x3a/0x40 [f2fs]
   deactivate_locked_super+0x3d/0x70
   deactivate_super+0x40/0x60
   cleanup_mnt+0x39/0x70
   __cleanup_mnt+0x10/0x20
   task_work_run+0x69/0x80
   exit_to_usermode_loop+0x57/0x85
   do_fast_syscall_32+0x18c/0x1b0
   entry_SYSENTER_32+0x4c/0x7b
 irq event stamp: 1957420
 hardirqs last  enabled at (1957419): [<c1808f37>] _raw_spin_unlock_irq+0x27/0x50
 hardirqs last disabled at (1957420): [<c1809f9c>] call_function_single_interrupt+0x30/0x3c
 softirqs last  enabled at (1953784): [<c180cc4e>] __do_softirq+0x1ae/0x476
 softirqs last disabled at (1953773): [<c101ca7c>] do_softirq_own_stack+0x2c/0x40

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&(&sbi->cp_lock)->rlock);
   <Interrupt>
     lock(&(&sbi->cp_lock)->rlock);

  *** DEADLOCK ***

 2 locks held by xfs_io/7959:
  #0:  (sb_writers#13){.+.+.+}, at: [<c11fd7ca>] vfs_write+0x16a/0x190
  #1:  (&sb->s_type->i_mutex_key#16){+.+.+.}, at: [<f96e33f5>] f2fs_file_write_iter+0x25/0x140 [f2fs]

 stack backtrace:
 CPU: 2 PID: 7959 Comm: xfs_io Tainted: G           O    4.12.0-rc2+ #30
 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
 Call Trace:
  dump_stack+0x5f/0x92
  print_usage_bug+0x1d3/0x1dd
  ? check_usage_backwards+0xe0/0xe0
  mark_lock+0x23d/0x280
  __lock_acquire+0x699/0x7b0
  ? __this_cpu_preempt_check+0xf/0x20
  ? trace_hardirqs_off_caller+0x91/0xe0
  lock_acquire+0xae/0x220
  ? f2fs_stop_checkpoint+0x1c/0x50 [f2fs]
  _raw_spin_lock+0x42/0x50
  ? f2fs_stop_checkpoint+0x1c/0x50 [f2fs]
  f2fs_stop_checkpoint+0x1c/0x50 [f2fs]
  f2fs_write_end_io+0x147/0x150 [f2fs]
  bio_endio+0x7a/0x1e0
  blk_update_request+0xad/0x410
  blk_mq_end_request+0x16/0x60
  lo_complete_rq+0x3c/0x70
  __blk_mq_complete_request_remote+0x11/0x20
  flush_smp_call_function_queue+0x6d/0x120
  ? debug_smp_processor_id+0x12/0x20
  generic_smp_call_function_single_interrupt+0x12/0x30
  smp_call_function_single_interrupt+0x25/0x40
  call_function_single_interrupt+0x37/0x3c
 EIP: _raw_spin_unlock_irq+0x2d/0x50
 EFLAGS: 00000296 CPU: 2
 EAX: 00000001 EBX: d2ccc51c ECX: 00000001 EDX: c1aacebd
 ESI: 00000000 EDI: 00000000 EBP: c96c9d1c ESP: c96c9d18
  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
  ? inherit_task_group.isra.98.part.99+0x6b/0xb0
  __add_to_page_cache_locked+0x1d4/0x290
  add_to_page_cache_lru+0x38/0xb0
  pagecache_get_page+0x8e/0x200
  f2fs_write_begin+0x96/0xf00 [f2fs]
  ? trace_hardirqs_on_caller+0xdd/0x1c0
  ? current_time+0x17/0x50
  ? trace_hardirqs_on+0xb/0x10
  generic_perform_write+0xa9/0x170
  __generic_file_write_iter+0x1a2/0x1f0
  ? f2fs_preallocate_blocks+0x137/0x160 [f2fs]
  f2fs_file_write_iter+0x6e/0x140 [f2fs]
  ? __lock_acquire+0x429/0x7b0
  __vfs_write+0xc1/0x140
  vfs_write+0x9b/0x190
  SyS_pwrite64+0x63/0xa0
  do_fast_syscall_32+0xa1/0x1b0
  entry_SYSENTER_32+0x4c/0x7b
 EIP: 0xb7786c61
 EFLAGS: 00000293 CPU: 2
 EAX: ffffffda EBX: 00000003 ECX: 08416000 EDX: 00001000
 ESI: 18b24000 EDI: 00000000 EBP: 00000003 ESP: bf9b36b0
  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b

Fixes: aaec2b1 ("f2fs: introduce cp_lock to protect updating of ckpt_flags")
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Jajcus added a commit to Jajcus/rpi-kernel-rt that referenced this issue Dec 20, 2017

rcu: Frob softirq test
With RT_FULL we get the below wreckage:

[  126.060484] =======================================================
[  126.060486] [ INFO: possible circular locking dependency detected ]
[  126.060489] 3.0.1-rt10+ #30
[  126.060490] -------------------------------------------------------
[  126.060492] irq/24-eth0/1235 is trying to acquire lock:
[  126.060495]  (&(lock)->wait_lock#2){+.+...}, at: [<ffffffff81501c81>] rt_mutex_slowunlock+0x16/0x55
[  126.060503] 
[  126.060504] but task is already holding lock:
[  126.060506]  (&p->pi_lock){-...-.}, at: [<ffffffff81074fdc>] try_to_wake_up+0x35/0x429
[  126.060511] 
[  126.060511] which lock already depends on the new lock.
[  126.060513] 
[  126.060514] 
[  126.060514] the existing dependency chain (in reverse order) is:
[  126.060516] 
[  126.060516] -> #1 (&p->pi_lock){-...-.}:
[  126.060519]        [<ffffffff810afe9e>] lock_acquire+0x145/0x18a
[  126.060524]        [<ffffffff8150291e>] _raw_spin_lock_irqsave+0x4b/0x85
[  126.060527]        [<ffffffff810b5aa4>] task_blocks_on_rt_mutex+0x36/0x20f
[  126.060531]        [<ffffffff815019bb>] rt_mutex_slowlock+0xd1/0x15a
[  126.060534]        [<ffffffff81501ae3>] rt_mutex_lock+0x2d/0x2f
[  126.060537]        [<ffffffff810d9020>] rcu_boost+0xad/0xde
[  126.060541]        [<ffffffff810d90ce>] rcu_boost_kthread+0x7d/0x9b
[  126.060544]        [<ffffffff8109a760>] kthread+0x99/0xa1
[  126.060547]        [<ffffffff81509b14>] kernel_thread_helper+0x4/0x10
[  126.060551] 
[  126.060552] -> #0 (&(lock)->wait_lock#2){+.+...}:
[  126.060555]        [<ffffffff810af1b8>] __lock_acquire+0x1157/0x1816
[  126.060558]        [<ffffffff810afe9e>] lock_acquire+0x145/0x18a
[  126.060561]        [<ffffffff8150279e>] _raw_spin_lock+0x40/0x73
[  126.060564]        [<ffffffff81501c81>] rt_mutex_slowunlock+0x16/0x55
[  126.060566]        [<ffffffff81501ce7>] rt_mutex_unlock+0x27/0x29
[  126.060569]        [<ffffffff810d9f86>] rcu_read_unlock_special+0x17e/0x1c4
[  126.060573]        [<ffffffff810da014>] __rcu_read_unlock+0x48/0x89
[  126.060576]        [<ffffffff8106847a>] select_task_rq_rt+0xc7/0xd5
[  126.060580]        [<ffffffff8107511c>] try_to_wake_up+0x175/0x429
[  126.060583]        [<ffffffff81075425>] wake_up_process+0x15/0x17
[  126.060585]        [<ffffffff81080a51>] wakeup_softirqd+0x24/0x26
[  126.060590]        [<ffffffff81081df9>] irq_exit+0x49/0x55
[  126.060593]        [<ffffffff8150a3bd>] smp_apic_timer_interrupt+0x8a/0x98
[  126.060597]        [<ffffffff81509793>] apic_timer_interrupt+0x13/0x20
[  126.060600]        [<ffffffff810d5952>] irq_forced_thread_fn+0x1b/0x44
[  126.060603]        [<ffffffff810d582c>] irq_thread+0xde/0x1af
[  126.060606]        [<ffffffff8109a760>] kthread+0x99/0xa1
[  126.060608]        [<ffffffff81509b14>] kernel_thread_helper+0x4/0x10
[  126.060611] 
[  126.060612] other info that might help us debug this:
[  126.060614] 
[  126.060615]  Possible unsafe locking scenario:
[  126.060616] 
[  126.060617]        CPU0                    CPU1
[  126.060619]        ----                    ----
[  126.060620]   lock(&p->pi_lock);
[  126.060623]                                lock(&(lock)->wait_lock);
[  126.060625]                                lock(&p->pi_lock);
[  126.060627]   lock(&(lock)->wait_lock);
[  126.060629] 
[  126.060629]  *** DEADLOCK ***
[  126.060630] 
[  126.060632] 1 lock held by irq/24-eth0/1235:
[  126.060633]  #0:  (&p->pi_lock){-...-.}, at: [<ffffffff81074fdc>] try_to_wake_up+0x35/0x429
[  126.060638] 
[  126.060638] stack backtrace:
[  126.060641] Pid: 1235, comm: irq/24-eth0 Not tainted 3.0.1-rt10+ #30
[  126.060643] Call Trace:
[  126.060644]  <IRQ>  [<ffffffff810acbde>] print_circular_bug+0x289/0x29a
[  126.060651]  [<ffffffff810af1b8>] __lock_acquire+0x1157/0x1816
[  126.060655]  [<ffffffff810ab3aa>] ? trace_hardirqs_off_caller+0x1f/0x99
[  126.060658]  [<ffffffff81501c81>] ? rt_mutex_slowunlock+0x16/0x55
[  126.060661]  [<ffffffff810afe9e>] lock_acquire+0x145/0x18a
[  126.060664]  [<ffffffff81501c81>] ? rt_mutex_slowunlock+0x16/0x55
[  126.060668]  [<ffffffff8150279e>] _raw_spin_lock+0x40/0x73
[  126.060671]  [<ffffffff81501c81>] ? rt_mutex_slowunlock+0x16/0x55
[  126.060674]  [<ffffffff810d9655>] ? rcu_report_qs_rsp+0x87/0x8c
[  126.060677]  [<ffffffff81501c81>] rt_mutex_slowunlock+0x16/0x55
[  126.060680]  [<ffffffff810d9ea3>] ? rcu_read_unlock_special+0x9b/0x1c4
[  126.060683]  [<ffffffff81501ce7>] rt_mutex_unlock+0x27/0x29
[  126.060687]  [<ffffffff810d9f86>] rcu_read_unlock_special+0x17e/0x1c4
[  126.060690]  [<ffffffff810da014>] __rcu_read_unlock+0x48/0x89
[  126.060693]  [<ffffffff8106847a>] select_task_rq_rt+0xc7/0xd5
[  126.060696]  [<ffffffff810683da>] ? select_task_rq_rt+0x27/0xd5
[  126.060701]  [<ffffffff810a852a>] ? clockevents_program_event+0x8e/0x90
[  126.060704]  [<ffffffff8107511c>] try_to_wake_up+0x175/0x429
[  126.060708]  [<ffffffff810a95dc>] ? tick_program_event+0x1f/0x21
[  126.060711]  [<ffffffff81075425>] wake_up_process+0x15/0x17
[  126.060715]  [<ffffffff81080a51>] wakeup_softirqd+0x24/0x26
[  126.060718]  [<ffffffff81081df9>] irq_exit+0x49/0x55
[  126.060721]  [<ffffffff8150a3bd>] smp_apic_timer_interrupt+0x8a/0x98
[  126.060724]  [<ffffffff81509793>] apic_timer_interrupt+0x13/0x20
[  126.060726]  <EOI>  [<ffffffff81072855>] ? migrate_disable+0x75/0x12d
[  126.060733]  [<ffffffff81080a61>] ? local_bh_disable+0xe/0x1f
[  126.060736]  [<ffffffff81080a70>] ? local_bh_disable+0x1d/0x1f
[  126.060739]  [<ffffffff810d5952>] irq_forced_thread_fn+0x1b/0x44
[  126.060742]  [<ffffffff81502ac0>] ? _raw_spin_unlock_irq+0x3b/0x59
[  126.060745]  [<ffffffff810d582c>] irq_thread+0xde/0x1af
[  126.060748]  [<ffffffff810d5937>] ? irq_thread_fn+0x3a/0x3a
[  126.060751]  [<ffffffff810d574e>] ? irq_finalize_oneshot+0xd1/0xd1
[  126.060754]  [<ffffffff810d574e>] ? irq_finalize_oneshot+0xd1/0xd1
[  126.060757]  [<ffffffff8109a760>] kthread+0x99/0xa1
[  126.060761]  [<ffffffff81509b14>] kernel_thread_helper+0x4/0x10
[  126.060764]  [<ffffffff81069ed7>] ? finish_task_switch+0x87/0x10a
[  126.060768]  [<ffffffff81502ec4>] ? retint_restore_args+0xe/0xe
[  126.060771]  [<ffffffff8109a6c7>] ? __init_kthread_worker+0x8c/0x8c
[  126.060774]  [<ffffffff81509b10>] ? gs_change+0xb/0xb

Because irq_exit() does:

void irq_exit(void)
{
	account_system_vtime(current);
	trace_hardirq_exit();
	sub_preempt_count(IRQ_EXIT_OFFSET);
	if (!in_interrupt() && local_softirq_pending())
		invoke_softirq();

	...
}

Which triggers a wakeup, which uses RCU, now if the interrupted task has
t->rcu_read_unlock_special set, the rcu usage from the wakeup will end
up in rcu_read_unlock_special(). rcu_read_unlock_special() will test
for in_irq(), which will fail as we just decremented preempt_count
with IRQ_EXIT_OFFSET, and in_sering_softirq(), which for
PREEMPT_RT_FULL reads:

int in_serving_softirq(void)
{
	int res;

	preempt_disable();
	res = __get_cpu_var(local_softirq_runner) == current;
	preempt_enable();
	return res;
}

Which will thus also fail, resulting in the above wreckage.

The 'somewhat' ugly solution is to open-code the preempt_count() test
in rcu_read_unlock_special().

Also, we're not at all sure how ->rcu_read_unlock_special gets set
here... so this is very likely a bandaid and more thought is required.

Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment