kexec doesn't quite work. #27

Open
XECDesign opened this Issue May 20, 2012 · 16 comments

Comments

Projects
None yet
8 participants

There are some issues with booting kernels using kexec (it doesn't work). It would be great if someone could look into it. I know there's lots of interest in being able to select kernels and rootfs without modifying files and kexec is probably the simplest, so it's definitely worth fixing, I think.

Contributor

lp0 commented May 20, 2012

The power management code is probably blocking when it tries to power up the devices a second time.

diff --git a/arch/arm/mach-bcm2708/power.c b/arch/arm/mach-bcm2708/power.c
index a4139fc..7e418a1 100644
--- a/arch/arm/mach-bcm2708/power.c
+++ b/arch/arm/mach-bcm2708/power.c
@@ -163,6 +163,7 @@ static int __init bcm_power_init(void)
    int i;

    printk(KERN_INFO "bcm_power: Broadcom power driver\n");
+   bcm_mailbox_write(MBOX_CHAN_POWER, 0);

    for (i = 0; i < BCM_POWER_MAXCLIENTS; i++)
        g_state.client_request[i] = BCM_POWER_NOCLIENT;
Collaborator

popcornmix commented May 21, 2012

@XECDesign
Does it boot the first time?
Do you get any console output?

@popcornmix

If by first time you mean just a standard boot, then yeah everything works just fine.

Then, without Simon's patch, when I kexec the kernel again, it says something to the effect of "Starting new kernel... bye!" and nothing happens. I haven't tried looking at the serial port output yet though. Well, I know it starts to boot, but it doesn't get very far along... not far enough to put anything on the screen.

@lp0
The patch works, thanks. However, the resolution is reset to something low, so the hdmi_mode option is ignored. Also, there are some MMC errors, but it still boots everything fine.

I'll post more info (the exact errors) once I get the serial port working.

Contributor

lp0 commented May 21, 2012

"the resolution is reset to something low, so the hdmi_mode option is ignored"

Check the parameters you're passing to bcm2708_fb on the kernel command line.

After playing around with it a bit more, yeah, it definitely works great. Haven't run into any issues yet.

Collaborator

popcornmix commented May 26, 2012

I've pushed lp0 suggestion.

popcornmix closed this May 26, 2012

Thanks! =)

asb reopened this May 27, 2012

asb commented May 27, 2012

Reopening, as ShiftPlusOne is reporting vchiq issues after kexec (details to follow)

@popcornmix , @lp0
I've run into some trouble. Video core doesn't quite work after a kexec boot.

When launching OpenELEC and hello_triangle.bin I get the following errors:
vchiq_get_state: g_state.remote->initialised != 1 (0)
vcos: [1679]: vchiq has no connection to VideoCore

I've attached full dmesg log here.
OpenELEC:
http://pastebin.com/nS5P8A7U
ArchLinux (hello_triangle):
http://pastebin.com/Bkci4ECe

Let me know if I can provide any more information.

Contributor

Ferroin commented Aug 28, 2012

kexec support is still labeled as experimental on the ARM architecture in the official release, this should really be reported (and followed for updates) there. As far as I know this is one of the big areas of focus for kernel development. Although, on the Pi it really isn't that useful from what I can tell, I clocked rebooting both through kexec and normally, and get only sub-second differences in timing.

@popcornmix popcornmix pushed a commit that referenced this issue Oct 13, 2012

Luck, Tony + H. Peter Anvin x86: Remove some noise from boot log when starting cpus
Printing the "start_ip" for every secondary cpu is very noisy on a large
system - and doesn't add any value. Drop this message.

Console log before:
Booting Node   0, Processors  #1
smpboot cpu 1: start_ip = 96000
 #2
smpboot cpu 2: start_ip = 96000
 #3
smpboot cpu 3: start_ip = 96000
 #4
smpboot cpu 4: start_ip = 96000
       ...
 #31
smpboot cpu 31: start_ip = 96000
Brought up 32 CPUs

Console log after:
Booting Node   0, Processors  #1 #2 #3 #4 #5 #6 #7 Ok.
Booting Node   1, Processors  #8 #9 #10 #11 #12 #13 #14 #15 Ok.
Booting Node   0, Processors  #16 #17 #18 #19 #20 #21 #22 #23 Ok.
Booting Node   1, Processors  #24 #25 #26 #27 #28 #29 #30 #31
Brought up 32 CPUs

Acked-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4f452eb42507460426@agluck-desktop.sc.intel.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
140f190

@popcornmix popcornmix pushed a commit that referenced this issue Oct 13, 2012

@olofj @axboe olofj + axboe block: uninitialized ioc->nr_tasks triggers WARN_ON
Hi,

I'm using the old-fashioned 'dump' backup tool, and I noticed that it spews the
below warning as of 3.5-rc1 and later (3.4 is fine):

[   10.886893] ------------[ cut here ]------------
[   10.886904] WARNING: at include/linux/iocontext.h:140 copy_process+0x1488/0x1560()
[   10.886905] Hardware name: Bochs
[   10.886906] Modules linked in:
[   10.886908] Pid: 2430, comm: dump Not tainted 3.5.0-rc7+ #27
[   10.886908] Call Trace:
[   10.886911]  [<ffffffff8107ce8a>] warn_slowpath_common+0x7a/0xb0
[   10.886912]  [<ffffffff8107ced5>] warn_slowpath_null+0x15/0x20
[   10.886913]  [<ffffffff8107c088>] copy_process+0x1488/0x1560
[   10.886914]  [<ffffffff8107c244>] do_fork+0xb4/0x340
[   10.886918]  [<ffffffff8108effa>] ? recalc_sigpending+0x1a/0x50
[   10.886919]  [<ffffffff8108f6b2>] ? __set_task_blocked+0x32/0x80
[   10.886920]  [<ffffffff81091afa>] ? __set_current_blocked+0x3a/0x60
[   10.886923]  [<ffffffff81051db3>] sys_clone+0x23/0x30
[   10.886925]  [<ffffffff8179bd73>] stub_clone+0x13/0x20
[   10.886927]  [<ffffffff8179baa2>] ? system_call_fastpath+0x16/0x1b
[   10.886928] ---[ end trace 32a14af7ee6a590b ]---

Reproducing is easy, I can hit it on a KVM system with a very basic
config (x86_64 make defconfig + enable the drivers needed). To hit it,
just install dump (on debian/ubuntu, not sure what the package might be
called on Fedora), and:

dump -o -f /tmp/foo /

You'll see the warning in dmesg once it forks off the I/O process and
starts dumping filesystem contents.

I bisected it down to the following commit:

commit f6e8d01
Author: Tejun Heo <tj@kernel.org>
Date:   Mon Mar 5 13:15:26 2012 -0800

    block: add io_context->active_ref

    Currently ioc->nr_tasks is used to decide two things - whether an ioc
    is done issuing IOs and whether it's shared by multiple tasks.  This
    patch separate out the first into ioc->active_ref, which is acquired
    and released using {get|put}_io_context_active() respectively.

    This will be used to associate bio's with a given task.  This patch
    doesn't introduce any visible behavior change.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Cc: Vivek Goyal <vgoyal@redhat.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

It seems like the init of ioc->nr_tasks was removed in that patch,
so it starts out at 0 instead of 1.

Tejun, is the right thing here to add back the init, or should something else
be done?

The below patch removes the warning, but I haven't done any more extensive
testing on it.

Signed-off-by: Olof Johansson <olof@lixom.net>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4638a83

Hi, I'm wondering if something has broken in kexec since these messages here. I'm running the latest kernel and firmware (as of 3rd Dec 2012) and I can't get get kexec to work at all. I posted a detailed summary of the problem over on the RPi forums (here http://www.raspberrypi.org/phpBB3/viewtopic.php?f=28&t=24670&sid=bea05ce6eda0f6c50fb2c99ccbb2ab33). The quick summary of that is that kexec just gives a message about starting the new kernel, but then there's no output from the new kernel. I put in some printks to see which bits were succeeding, and it seems that we never return from machine_kexec() and that the problem lies somewhere in cpu_reset(), which is the last line executed in there.

@swarren swarren pushed a commit to swarren/linux-rpi that referenced this issue Jan 4, 2013

@broonie Inderpal Singh + broonie regulator: s5m8767: Fix probe failure due to stack corruption
The function sec_reg_read invokes regmap_read which expects unsigned int *
as the destination address. The existing driver is passing address of local
variable "val" which is u8. This causes the stack corruption and following
dump is observed during probe.

Hence change "val" from u8 to unsigned int.

Unable to handle kernel paging request at virtual address 02410020
pgd = c0004000
[02410020] *pgd=00000000
Internal error: Oops: 80000005 [#1] PREEMPT SMP ARM
Modules linked in:
CPU: 0    Not tainted  (3.6.0-00696-g98a28b18-dirty #27)
PC is at 0x2410020
LR is at _regulator_get_voltage+0x3c/0x70
pc : [<02410020>]    lr : [<c02395d4>]    psr: 20000013
sp : cf839b68  ip : 00000000  fp : cf92d410
r10: 0000cfd0  r9 : c06d9878  r8 : 0000f0a0
r7 : cf839b70  r6 : cf92d400  r5 : 00000011  r4 : cf000000
r3 : 02410020  r2 : 00000000  r1 : 00000048  r0 : cf000000
Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment kernel
...........................
.................................

[<c02395d4>] (_regulator_get_voltage+0x3c/0x70) from [<c023ad80>] (print_constraints+0x50/0x36c)
[<c023ad80>] (print_constraints+0x50/0x36c) from [<c023e504>] (set_machine_constraints+0xe8/0x2b0)
[<c023e504>] (set_machine_constraints+0xe8/0x2b0) from [<c023e9c8>] (regulator_register+0x2fc/0x604)
[<c023e9c8>] (regulator_register+0x2fc/0x604) from [<c049d628>] (s5m8767_pmic_probe+0x688/0x718)
[<c049d628>] (s5m8767_pmic_probe+0x688/0x718) from [<c029915c>] (platform_drv_probe+0x18/0x1c)
[<c029915c>] (platform_drv_probe+0x18/0x1c) from [<c0297dd0>] (really_probe+0x68/0x1f4)
[<c0297dd0>] (really_probe+0x68/0x1f4) from [<c0298070>] (driver_probe_device+0x30/0x48)

Signed-off-by: Inderpal Singh <inderpal.singh@linaro.org>
Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
3ef3039
Contributor

Ferroin commented Feb 14, 2013

Does kdump work? On some arches, kdump doesn't use exactly the same code as kexec, and if it works in this case it will make it much easier to figure out what is wrong with kexec.

@popcornmix popcornmix pushed a commit that referenced this issue May 1, 2013

@JoePerches @torvalds JoePerches + torvalds checkpatch: fix stringification macro defect
Fix checkpatch misreporting defect with stringification macros

ERROR: Macros with complex values should be enclosed in parenthesis
  #27: FILE: arch/arm/include/asm/kgdb.h:41:
  +#define ___to_string(X) #X

Signed-off-by: Joe Perches <joe@perches.com>
Reported-by: Vincent Stehlé <v-stehle@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
e942e2c

mkj commented Jun 10, 2013

Using a kernel built with earlyprintk it looks like the scheduler isn't switching to the kthreadd task for some reason. In the trace below the dd82a040 task is kthreadd and __schedule() is about to context_switch() to it. The "f" printk is at the bottom of finish_task_switch(), but a printk right at the very bottom of context_switch() doesn't appear (after return from finish_task_switch). Or perhaps that's normal - I'm not too familiar with scheduler code, and printk side-effects.

Any ideas where to look further as a culprit? My first guess would be something memory or cache related? Disabling I- and D- caches didn't help.

[    0.138046] Initializing cgroup subsys freezer
[    0.142711] Initializing cgroup subsys blkio
[    0.147415] CPU: Testing write buffer coherency: ok
[    0.152678] entering rest_init
[    0.156245] created kernel_init thread
[    0.160353] created kthreadd thread
[    0.164113] kthreadd_task dd82a040
[    0.167705] signalled kthreadd_done
[    0.171387] about to schedule() in schedule_preempt_disabled
[    0.177356] n dd82a040
[    0.179888] s
[    0.181614] .
[    0.183339] f

Nothing more printed.

From a28383cc5990e64e1ed95f484e222aa576585d7e Mon 3 June, built with RPi toolchain.

mkj commented Jun 10, 2013

Kexec seems to work fine booting rpi-3.9.y , booting either from that or from the 3.6 kernel. That'll teach me for not trying the newest version first!

Collaborator

popcornmix commented Jun 10, 2013

@mkj interesting.
Nothing's changed on our side that I can think of that would fix this. Perhaps there was some upstream fix.

@swarren swarren pushed a commit to swarren/linux-rpi that referenced this issue Nov 22, 2013

Borislav Petkov + Ingo Molnar x86: Improve the printout of the SMP bootup CPU table
As the new x86 CPU bootup printout format code maintainer, I am
taking immediate action to improve and clean (and thus indulge
my OCD) the reporting of the cores when coming up online.

Fix padding to a right-hand alignment, cleanup code and bind
reporting width to the max number of supported CPUs on the
system, like this:

 [    0.074509] smpboot: Booting Node   0, Processors:      #1  #2  #3  #4  #5  #6  #7 OK
 [    0.644008] smpboot: Booting Node   1, Processors:  #8  #9 #10 #11 #12 #13 #14 #15 OK
 [    1.245006] smpboot: Booting Node   2, Processors: #16 #17 #18 #19 #20 #21 #22 #23 OK
 [    1.864005] smpboot: Booting Node   3, Processors: #24 #25 #26 #27 #28 #29 #30 #31 OK
 [    2.489005] smpboot: Booting Node   4, Processors: #32 #33 #34 #35 #36 #37 #38 #39 OK
 [    3.093005] smpboot: Booting Node   5, Processors: #40 #41 #42 #43 #44 #45 #46 #47 OK
 [    3.698005] smpboot: Booting Node   6, Processors: #48 #49 #50 #51 #52 #53 #54 #55 OK
 [    4.304005] smpboot: Booting Node   7, Processors: #56 #57 #58 #59 #60 #61 #62 #63 OK
 [    4.961413] Brought up 64 CPUs

and this:

 [    0.072367] smpboot: Booting Node   0, Processors:    #1 #2 #3 #4 #5 #6 #7 OK
 [    0.686329] Brought up 8 CPUs

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Libin <huawei.libin@huawei.com>
Cc: wangyijing@huawei.com
Cc: fenghua.yu@intel.com
Cc: guohanjun@huawei.com
Cc: paul.gortmaker@windriver.com
Link: http://lkml.kernel.org/r/20130927143554.GF4422@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
646e29a

@swarren swarren pushed a commit to swarren/linux-rpi that referenced this issue Nov 22, 2013

Borislav Petkov + Ingo Molnar x86/boot: Further compress CPUs bootup message
Turn it into (for example):

[    0.073380] x86: Booting SMP configuration:
[    0.074005] .... node   #0, CPUs:          #1   #2   #3   #4   #5   #6   #7
[    0.603005] .... node   #1, CPUs:     #8   #9  #10  #11  #12  #13  #14  #15
[    1.200005] .... node   #2, CPUs:    #16  #17  #18  #19  #20  #21  #22  #23
[    1.796005] .... node   #3, CPUs:    #24  #25  #26  #27  #28  #29  #30  #31
[    2.393005] .... node   #4, CPUs:    #32  #33  #34  #35  #36  #37  #38  #39
[    2.996005] .... node   #5, CPUs:    #40  #41  #42  #43  #44  #45  #46  #47
[    3.600005] .... node   #6, CPUs:    #48  #49  #50  #51  #52  #53  #54  #55
[    4.202005] .... node   #7, CPUs:    #56  #57  #58  #59  #60  #61  #62  #63
[    4.811005] .... node   #8, CPUs:    #64  #65  #66  #67  #68  #69  #70  #71
[    5.421006] .... node   #9, CPUs:    #72  #73  #74  #75  #76  #77  #78  #79
[    6.032005] .... node  #10, CPUs:    #80  #81  #82  #83  #84  #85  #86  #87
[    6.648006] .... node  #11, CPUs:    #88  #89  #90  #91  #92  #93  #94  #95
[    7.262005] .... node  #12, CPUs:    #96  #97  #98  #99 #100 #101 #102 #103
[    7.865005] .... node  #13, CPUs:   #104 #105 #106 #107 #108 #109 #110 #111
[    8.466005] .... node  #14, CPUs:   #112 #113 #114 #115 #116 #117 #118 #119
[    9.073006] .... node  #15, CPUs:   #120 #121 #122 #123 #124 #125 #126 #127
[    9.679901] x86: Booted up 16 nodes, 128 CPUs

and drop useless elements.

Change num_digits() to hpa's division-avoiding, cell-phone-typed
version which he went at great lengths and pains to submit on a
Saturday evening.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: huawei.libin@huawei.com
Cc: wangyijing@huawei.com
Cc: fenghua.yu@intel.com
Cc: guohanjun@huawei.com
Cc: paul.gortmaker@windriver.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20130930095624.GB16383@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
a17bce4

Hey i would like to bump this thread,

Still having trouble with a kexec't kernel:

Kernel Version:
Linux version 3.14.6+ (root@dev-vm) (gcc version 4.8.3 (Buildroot 2014.08-rc2-00028-gf67e7ba) )

dmesg after
sudo modprobe bcm2835_v4l2
http://pastebin.com/H0f4T5tJ

Modprobe is working without issues if the kernel is executed directly.
Any idears how to fix this?

Thank you :)

@popcornmix popcornmix pushed a commit that referenced this issue Oct 8, 2014

@newbg @davem330 newbg + davem330 bonding: fix div by zero while enslaving and transmitting
The problem is that the slave is first linked and slave_cnt is
incremented afterwards leading to a div by zero in the modes that use it
as a modulus. What happens is that in bond_start_xmit()
bond_has_slaves() is used to evaluate further transmission and it becomes
true after the slave is linked in, but when slave_cnt is used in the xmit
path it is still 0, so fetch it once and transmit based on that. Since
it is used only in round-robin and XOR modes, the fix is only for them.
Thanks to Eric Dumazet for pointing out the fault in my first try to fix
this.

Call trace (took it out of net-next kernel, but it's the same with net):
[46934.330038] divide error: 0000 [#1] SMP
[46934.330041] Modules linked in: bonding(O) 9p fscache
snd_hda_codec_generic crct10dif_pclmul
[46934.330041] bond0: Enslaving eth1 as an active interface with an up
link
[46934.330051]  ppdev joydev crc32_pclmul crc32c_intel 9pnet_virtio
ghash_clmulni_intel snd_hda_intel 9pnet snd_hda_controller parport_pc
serio_raw pcspkr snd_hda_codec parport virtio_balloon virtio_console
snd_hwdep snd_pcm pvpanic i2c_piix4 snd_timer i2ccore snd soundcore
virtio_blk virtio_net virtio_pci virtio_ring virtio ata_generic
pata_acpi floppy [last unloaded: bonding]
[46934.330053] CPU: 1 PID: 3382 Comm: ping Tainted: G           O
3.17.0-rc4+ #27
[46934.330053] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[46934.330054] task: ffff88005aebf2c0 ti: ffff88005b728000 task.ti:
ffff88005b728000
[46934.330059] RIP: 0010:[<ffffffffa0198c33>]  [<ffffffffa0198c33>]
bond_start_xmit+0x1c3/0x450 [bonding]
[46934.330060] RSP: 0018:ffff88005b72b7f8  EFLAGS: 00010246
[46934.330060] RAX: 0000000000000679 RBX: ffff88004b077000 RCX:
000000000000002a
[46934.330061] RDX: 0000000000000000 RSI: ffff88004b3f0500 RDI:
ffff88004b077940
[46934.330061] RBP: ffff88005b72b830 R08: 00000000000000c0 R09:
ffff88004a83e000
[46934.330062] R10: 000000000000ffff R11: ffff88004b1f12c0 R12:
ffff88004b3f0500
[46934.330062] R13: ffff88004b3f0500 R14: 000000000000002a R15:
ffff88004b077940
[46934.330063] FS:  00007fbd91a4c740(0000) GS:ffff88005f080000(0000)
knlGS:0000000000000000
[46934.330064] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[46934.330064] CR2: 00007f803a8bb000 CR3: 000000004b2c9000 CR4:
00000000000406e0
[46934.330069] Stack:
[46934.330071]  ffffffff811e6169 00000000e772fa05 ffff88004b077000
ffff88004b3f0500
[46934.330072]  ffffffff81d17d18 000000000000002a 0000000000000000
ffff88005b72b8a0
[46934.330073]  ffffffff81620108 ffffffff8161fe0e ffff88005b72b8c4
ffff88005b302000
[46934.330073] Call Trace:
[46934.330077]  [<ffffffff811e6169>] ?
__kmalloc_node_track_caller+0x119/0x300
[46934.330084]  [<ffffffff81620108>] dev_hard_start_xmit+0x188/0x410
[46934.330086]  [<ffffffff8161fe0e>] ? harmonize_features+0x2e/0x90
[46934.330088]  [<ffffffff81620b06>] __dev_queue_xmit+0x456/0x590
[46934.330089]  [<ffffffff81620c50>] dev_queue_xmit+0x10/0x20
[46934.330090]  [<ffffffff8168f022>] arp_xmit+0x22/0x60
[46934.330091]  [<ffffffff8168f090>] arp_send.part.16+0x30/0x40
[46934.330092]  [<ffffffff8168f1e5>] arp_solicit+0x115/0x2b0
[46934.330094]  [<ffffffff8160b5d7>] ? copy_skb_header+0x17/0xa0
[46934.330096]  [<ffffffff8162875a>] neigh_probe+0x4a/0x70
[46934.330097]  [<ffffffff8162979c>] __neigh_event_send+0xac/0x230
[46934.330098]  [<ffffffff8162a00b>] neigh_resolve_output+0x13b/0x220
[46934.330100]  [<ffffffff8165f120>] ? ip_forward_options+0x1c0/0x1c0
[46934.330101]  [<ffffffff81660478>] ip_finish_output+0x1f8/0x860
[46934.330102]  [<ffffffff81661f08>] ip_output+0x58/0x90
[46934.330103]  [<ffffffff81661602>] ? __ip_local_out+0xa2/0xb0
[46934.330104]  [<ffffffff81661640>] ip_local_out_sk+0x30/0x40
[46934.330105]  [<ffffffff81662a66>] ip_send_skb+0x16/0x50
[46934.330106]  [<ffffffff81662ad3>] ip_push_pending_frames+0x33/0x40
[46934.330107]  [<ffffffff8168854c>] raw_sendmsg+0x88c/0xa30
[46934.330110]  [<ffffffff81612b31>] ? skb_recv_datagram+0x41/0x60
[46934.330111]  [<ffffffff816875a9>] ? raw_recvmsg+0xa9/0x1f0
[46934.330113]  [<ffffffff816978d4>] inet_sendmsg+0x74/0xc0
[46934.330114]  [<ffffffff81697a9b>] ? inet_recvmsg+0x8b/0xb0
[46934.330115] bond0: Adding slave eth2
[46934.330116]  [<ffffffff8160357c>] sock_sendmsg+0x9c/0xe0
[46934.330118]  [<ffffffff81603248>] ?
move_addr_to_kernel.part.20+0x28/0x80
[46934.330121]  [<ffffffff811b4477>] ? might_fault+0x47/0x50
[46934.330122]  [<ffffffff816039b9>] ___sys_sendmsg+0x3a9/0x3c0
[46934.330125]  [<ffffffff8144a14a>] ? n_tty_write+0x3aa/0x530
[46934.330127]  [<ffffffff810d1ae4>] ? __wake_up+0x44/0x50
[46934.330129]  [<ffffffff81242b38>] ? fsnotify+0x238/0x310
[46934.330130]  [<ffffffff816048a1>] __sys_sendmsg+0x51/0x90
[46934.330131]  [<ffffffff816048f2>] SyS_sendmsg+0x12/0x20
[46934.330134]  [<ffffffff81738b29>] system_call_fastpath+0x16/0x1b
[46934.330144] Code: 48 8b 10 4c 89 ee 4c 89 ff e8 aa bc ff ff 31 c0 e9
1a ff ff ff 0f 1f 00 4c 89 ee 4c 89 ff e8 65 fb ff ff 31 d2 4c 89 ee 4c
89 ff <f7> b3 64 09 00 00 e8 02 bd ff ff 31 c0 e9 f2 fe ff ff 0f 1f 00
[46934.330146] RIP  [<ffffffffa0198c33>] bond_start_xmit+0x1c3/0x450
[bonding]
[46934.330146]  RSP <ffff88005b72b7f8>

CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
Fixes: 278b208 ("bonding: initial RCU conversion")
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9a72c2d

@swarren swarren pushed a commit to swarren/linux-rpi that referenced this issue Oct 17, 2014

Dave Hansen + Ingo Molnar x86, sched: Add new topology for multi-NUMA-node CPUs
I'm getting the spew below when booting with Haswell (Xeon
E5-2699 v3) CPUs and the "Cluster-on-Die" (CoD) feature enabled
in the BIOS.  It seems similar to the issue that some folks from
AMD ran in to on their systems and addressed in this commit:

  161270f ("x86/smp: Fix topology checks on AMD MCM CPUs")

Both these Intel and AMD systems break an assumption which is
being enforced by topology_sane(): a socket may not contain more
than one NUMA node.

AMD special-cased their system by looking for a cpuid flag.  The
Intel mode is dependent on BIOS options and I do not know of a
way which it is enumerated other than the tables being parsed
during the CPU bringup process.  In other words, we have to trust
the ACPI tables <shudder>.

This detects the situation where a NUMA node occurs at a place in
the middle of the "CPU" sched domains.  It replaces the default
topology with one that relies on the NUMA information from the
firmware (SRAT table) for all levels of sched domains above the
hyperthreads.

This also fixes a sysfs bug.  We used to freak out when we saw
the "mc" group cross a node boundary, so we stopped building the
MC group.  MC gets exported as the 'core_siblings_list' in
/sys/devices/system/cpu/cpu*/topology/ and this caused CPUs with
the same 'physical_package_id' to not be listed together in
'core_siblings_list'.  This violates a statement from
Documentation/ABI/testing/sysfs-devices-system-cpu:

	core_siblings: internal kernel map of cpu#'s hardware threads
	within the same physical_package_id.

	core_siblings_list: human-readable list of the logical CPU
	numbers within the same physical_package_id as cpu#.

The sysfs effects here cause an issue with the hwloc tool where
it gets confused and thinks there are more sockets than are
physically present.

Before this patch, there are two packages:

# cd /sys/devices/system/cpu/
# cat cpu*/topology/physical_package_id | sort | uniq -c
     18 0
     18 1

But 4 _sets_ of core siblings:

# cat cpu*/topology/core_siblings_list | sort | uniq -c
      9 0-8
      9 18-26
      9 27-35
      9 9-17

After this set, there are only 2 sets of core siblings, which
is what we expect for a 2-socket system.

# cat cpu*/topology/physical_package_id | sort | uniq -c
     18 0
     18 1
# cat cpu*/topology/core_siblings_list | sort | uniq -c
     18 0-17
     18 18-35

Example spew:
...
	NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
	 #2  #3  #4  #5  #6  #7  #8
	.... node  #1, CPUs:    #9
	------------[ cut here ]------------
	WARNING: CPU: 9 PID: 0 at /home/ak/hle/linux-hle-2.6/arch/x86/kernel/smpboot.c:306 topology_sane.isra.2+0x74/0x90()
	sched: CPU #9's mc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
	Modules linked in:
	CPU: 9 PID: 0 Comm: swapper/9 Not tainted 3.17.0-rc1-00293-g8e01c4d-dirty #631
	Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS GRNDSDP1.86B.0036.R05.1407140519 07/14/2014
	0000000000000009 ffff88046ddabe00 ffffffff8172e485 ffff88046ddabe48
	ffff88046ddabe38 ffffffff8109691d 000000000000b001 0000000000000009
	ffff88086fc12580 000000000000b020 0000000000000009 ffff88046ddabe98
	Call Trace:
	[<ffffffff8172e485>] dump_stack+0x45/0x56
	[<ffffffff8109691d>] warn_slowpath_common+0x7d/0xa0
	[<ffffffff8109698c>] warn_slowpath_fmt+0x4c/0x50
	[<ffffffff81074f94>] topology_sane.isra.2+0x74/0x90
	[<ffffffff8107530e>] set_cpu_sibling_map+0x31e/0x4f0
	[<ffffffff8107568d>] start_secondary+0x1ad/0x240
	---[ end trace 3fe5f587a9fcde61 ]---
	#10 #11 #12 #13 #14 #15 #16 #17
	.... node  #2, CPUs:   #18 #19 #20 #21 #22 #23 #24 #25 #26
	.... node  #3, CPUs:   #27 #28 #29 #30 #31 #32 #33 #34 #35

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
[ Added LLC domain and s/match_mc/match_die/ ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: brice.goglin@gmail.com
Cc: "H. Peter Anvin" <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/20140918193334.C065EBCE@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
cebf15e

@swarren swarren pushed a commit to swarren/linux-rpi that referenced this issue Nov 5, 2014

@pranith @mpe pranith + mpe powerpc: Wire up sys_bpf() syscall
This patch wires up the new syscall sys_bpf() on powerpc.

Passes the tests in samples/bpf:

    #0 add+sub+mul OK
    #1 unreachable OK
    #2 unreachable2 OK
    #3 out of range jump OK
    #4 out of range jump2 OK
    #5 test1 ld_imm64 OK
    #6 test2 ld_imm64 OK
    #7 test3 ld_imm64 OK
    #8 test4 ld_imm64 OK
    #9 test5 ld_imm64 OK
    #10 no bpf_exit OK
    #11 loop (back-edge) OK
    #12 loop2 (back-edge) OK
    #13 conditional loop OK
    #14 read uninitialized register OK
    #15 read invalid register OK
    #16 program doesn't init R0 before exit OK
    #17 stack out of bounds OK
    #18 invalid call insn1 OK
    #19 invalid call insn2 OK
    #20 invalid function call OK
    #21 uninitialized stack1 OK
    #22 uninitialized stack2 OK
    #23 check valid spill/fill OK
    #24 check corrupted spill/fill OK
    #25 invalid src register in STX OK
    #26 invalid dst register in STX OK
    #27 invalid dst register in ST OK
    #28 invalid src register in LDX OK
    #29 invalid dst register in LDX OK
    #30 junk insn OK
    #31 junk insn2 OK
    #32 junk insn3 OK
    #33 junk insn4 OK
    #34 junk insn5 OK
    #35 misaligned read from stack OK
    #36 invalid map_fd for function call OK
    #37 don't check return value before access OK
    #38 access memory with incorrect alignment OK
    #39 sometimes access memory with incorrect alignment OK
    #40 jump test 1 OK
    #41 jump test 2 OK
    #42 jump test 3 OK
    #43 jump test 4 OK

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
[mpe: test using samples/bpf]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
fcbb539

@popcornmix popcornmix pushed a commit that referenced this issue Nov 7, 2014

@newbg @gregkh newbg + gregkh bonding: fix div by zero while enslaving and transmitting
[ Upstream commit 9a72c2d ]

The problem is that the slave is first linked and slave_cnt is
incremented afterwards leading to a div by zero in the modes that use it
as a modulus. What happens is that in bond_start_xmit()
bond_has_slaves() is used to evaluate further transmission and it becomes
true after the slave is linked in, but when slave_cnt is used in the xmit
path it is still 0, so fetch it once and transmit based on that. Since
it is used only in round-robin and XOR modes, the fix is only for them.
Thanks to Eric Dumazet for pointing out the fault in my first try to fix
this.

Call trace (took it out of net-next kernel, but it's the same with net):
[46934.330038] divide error: 0000 [#1] SMP
[46934.330041] Modules linked in: bonding(O) 9p fscache
snd_hda_codec_generic crct10dif_pclmul
[46934.330041] bond0: Enslaving eth1 as an active interface with an up
link
[46934.330051]  ppdev joydev crc32_pclmul crc32c_intel 9pnet_virtio
ghash_clmulni_intel snd_hda_intel 9pnet snd_hda_controller parport_pc
serio_raw pcspkr snd_hda_codec parport virtio_balloon virtio_console
snd_hwdep snd_pcm pvpanic i2c_piix4 snd_timer i2ccore snd soundcore
virtio_blk virtio_net virtio_pci virtio_ring virtio ata_generic
pata_acpi floppy [last unloaded: bonding]
[46934.330053] CPU: 1 PID: 3382 Comm: ping Tainted: G           O
3.17.0-rc4+ #27
[46934.330053] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[46934.330054] task: ffff88005aebf2c0 ti: ffff88005b728000 task.ti:
ffff88005b728000
[46934.330059] RIP: 0010:[<ffffffffa0198c33>]  [<ffffffffa0198c33>]
bond_start_xmit+0x1c3/0x450 [bonding]
[46934.330060] RSP: 0018:ffff88005b72b7f8  EFLAGS: 00010246
[46934.330060] RAX: 0000000000000679 RBX: ffff88004b077000 RCX:
000000000000002a
[46934.330061] RDX: 0000000000000000 RSI: ffff88004b3f0500 RDI:
ffff88004b077940
[46934.330061] RBP: ffff88005b72b830 R08: 00000000000000c0 R09:
ffff88004a83e000
[46934.330062] R10: 000000000000ffff R11: ffff88004b1f12c0 R12:
ffff88004b3f0500
[46934.330062] R13: ffff88004b3f0500 R14: 000000000000002a R15:
ffff88004b077940
[46934.330063] FS:  00007fbd91a4c740(0000) GS:ffff88005f080000(0000)
knlGS:0000000000000000
[46934.330064] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[46934.330064] CR2: 00007f803a8bb000 CR3: 000000004b2c9000 CR4:
00000000000406e0
[46934.330069] Stack:
[46934.330071]  ffffffff811e6169 00000000e772fa05 ffff88004b077000
ffff88004b3f0500
[46934.330072]  ffffffff81d17d18 000000000000002a 0000000000000000
ffff88005b72b8a0
[46934.330073]  ffffffff81620108 ffffffff8161fe0e ffff88005b72b8c4
ffff88005b302000
[46934.330073] Call Trace:
[46934.330077]  [<ffffffff811e6169>] ?
__kmalloc_node_track_caller+0x119/0x300
[46934.330084]  [<ffffffff81620108>] dev_hard_start_xmit+0x188/0x410
[46934.330086]  [<ffffffff8161fe0e>] ? harmonize_features+0x2e/0x90
[46934.330088]  [<ffffffff81620b06>] __dev_queue_xmit+0x456/0x590
[46934.330089]  [<ffffffff81620c50>] dev_queue_xmit+0x10/0x20
[46934.330090]  [<ffffffff8168f022>] arp_xmit+0x22/0x60
[46934.330091]  [<ffffffff8168f090>] arp_send.part.16+0x30/0x40
[46934.330092]  [<ffffffff8168f1e5>] arp_solicit+0x115/0x2b0
[46934.330094]  [<ffffffff8160b5d7>] ? copy_skb_header+0x17/0xa0
[46934.330096]  [<ffffffff8162875a>] neigh_probe+0x4a/0x70
[46934.330097]  [<ffffffff8162979c>] __neigh_event_send+0xac/0x230
[46934.330098]  [<ffffffff8162a00b>] neigh_resolve_output+0x13b/0x220
[46934.330100]  [<ffffffff8165f120>] ? ip_forward_options+0x1c0/0x1c0
[46934.330101]  [<ffffffff81660478>] ip_finish_output+0x1f8/0x860
[46934.330102]  [<ffffffff81661f08>] ip_output+0x58/0x90
[46934.330103]  [<ffffffff81661602>] ? __ip_local_out+0xa2/0xb0
[46934.330104]  [<ffffffff81661640>] ip_local_out_sk+0x30/0x40
[46934.330105]  [<ffffffff81662a66>] ip_send_skb+0x16/0x50
[46934.330106]  [<ffffffff81662ad3>] ip_push_pending_frames+0x33/0x40
[46934.330107]  [<ffffffff8168854c>] raw_sendmsg+0x88c/0xa30
[46934.330110]  [<ffffffff81612b31>] ? skb_recv_datagram+0x41/0x60
[46934.330111]  [<ffffffff816875a9>] ? raw_recvmsg+0xa9/0x1f0
[46934.330113]  [<ffffffff816978d4>] inet_sendmsg+0x74/0xc0
[46934.330114]  [<ffffffff81697a9b>] ? inet_recvmsg+0x8b/0xb0
[46934.330115] bond0: Adding slave eth2
[46934.330116]  [<ffffffff8160357c>] sock_sendmsg+0x9c/0xe0
[46934.330118]  [<ffffffff81603248>] ?
move_addr_to_kernel.part.20+0x28/0x80
[46934.330121]  [<ffffffff811b4477>] ? might_fault+0x47/0x50
[46934.330122]  [<ffffffff816039b9>] ___sys_sendmsg+0x3a9/0x3c0
[46934.330125]  [<ffffffff8144a14a>] ? n_tty_write+0x3aa/0x530
[46934.330127]  [<ffffffff810d1ae4>] ? __wake_up+0x44/0x50
[46934.330129]  [<ffffffff81242b38>] ? fsnotify+0x238/0x310
[46934.330130]  [<ffffffff816048a1>] __sys_sendmsg+0x51/0x90
[46934.330131]  [<ffffffff816048f2>] SyS_sendmsg+0x12/0x20
[46934.330134]  [<ffffffff81738b29>] system_call_fastpath+0x16/0x1b
[46934.330144] Code: 48 8b 10 4c 89 ee 4c 89 ff e8 aa bc ff ff 31 c0 e9
1a ff ff ff 0f 1f 00 4c 89 ee 4c 89 ff e8 65 fb ff ff 31 d2 4c 89 ee 4c
89 ff <f7> b3 64 09 00 00 e8 02 bd ff ff 31 c0 e9 f2 fe ff ff 0f 1f 00
[46934.330146] RIP  [<ffffffffa0198c33>] bond_start_xmit+0x1c3/0x450
[bonding]
[46934.330146]  RSP <ffff88005b72b7f8>

CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
Fixes: 278b208 ("bonding: initial RCU conversion")
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9ede8fd

@davet321 davet321 pushed a commit to davet321/rpi-linux that referenced this issue Dec 17, 2014

@newbg @gregkh newbg + gregkh bonding: fix div by zero while enslaving and transmitting
[ Upstream commit 9a72c2d ]

The problem is that the slave is first linked and slave_cnt is
incremented afterwards leading to a div by zero in the modes that use it
as a modulus. What happens is that in bond_start_xmit()
bond_has_slaves() is used to evaluate further transmission and it becomes
true after the slave is linked in, but when slave_cnt is used in the xmit
path it is still 0, so fetch it once and transmit based on that. Since
it is used only in round-robin and XOR modes, the fix is only for them.
Thanks to Eric Dumazet for pointing out the fault in my first try to fix
this.

Call trace (took it out of net-next kernel, but it's the same with net):
[46934.330038] divide error: 0000 [#1] SMP
[46934.330041] Modules linked in: bonding(O) 9p fscache
snd_hda_codec_generic crct10dif_pclmul
[46934.330041] bond0: Enslaving eth1 as an active interface with an up
link
[46934.330051]  ppdev joydev crc32_pclmul crc32c_intel 9pnet_virtio
ghash_clmulni_intel snd_hda_intel 9pnet snd_hda_controller parport_pc
serio_raw pcspkr snd_hda_codec parport virtio_balloon virtio_console
snd_hwdep snd_pcm pvpanic i2c_piix4 snd_timer i2ccore snd soundcore
virtio_blk virtio_net virtio_pci virtio_ring virtio ata_generic
pata_acpi floppy [last unloaded: bonding]
[46934.330053] CPU: 1 PID: 3382 Comm: ping Tainted: G           O
3.17.0-rc4+ #27
[46934.330053] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[46934.330054] task: ffff88005aebf2c0 ti: ffff88005b728000 task.ti:
ffff88005b728000
[46934.330059] RIP: 0010:[<ffffffffa0198c33>]  [<ffffffffa0198c33>]
bond_start_xmit+0x1c3/0x450 [bonding]
[46934.330060] RSP: 0018:ffff88005b72b7f8  EFLAGS: 00010246
[46934.330060] RAX: 0000000000000679 RBX: ffff88004b077000 RCX:
000000000000002a
[46934.330061] RDX: 0000000000000000 RSI: ffff88004b3f0500 RDI:
ffff88004b077940
[46934.330061] RBP: ffff88005b72b830 R08: 00000000000000c0 R09:
ffff88004a83e000
[46934.330062] R10: 000000000000ffff R11: ffff88004b1f12c0 R12:
ffff88004b3f0500
[46934.330062] R13: ffff88004b3f0500 R14: 000000000000002a R15:
ffff88004b077940
[46934.330063] FS:  00007fbd91a4c740(0000) GS:ffff88005f080000(0000)
knlGS:0000000000000000
[46934.330064] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[46934.330064] CR2: 00007f803a8bb000 CR3: 000000004b2c9000 CR4:
00000000000406e0
[46934.330069] Stack:
[46934.330071]  ffffffff811e6169 00000000e772fa05 ffff88004b077000
ffff88004b3f0500
[46934.330072]  ffffffff81d17d18 000000000000002a 0000000000000000
ffff88005b72b8a0
[46934.330073]  ffffffff81620108 ffffffff8161fe0e ffff88005b72b8c4
ffff88005b302000
[46934.330073] Call Trace:
[46934.330077]  [<ffffffff811e6169>] ?
__kmalloc_node_track_caller+0x119/0x300
[46934.330084]  [<ffffffff81620108>] dev_hard_start_xmit+0x188/0x410
[46934.330086]  [<ffffffff8161fe0e>] ? harmonize_features+0x2e/0x90
[46934.330088]  [<ffffffff81620b06>] __dev_queue_xmit+0x456/0x590
[46934.330089]  [<ffffffff81620c50>] dev_queue_xmit+0x10/0x20
[46934.330090]  [<ffffffff8168f022>] arp_xmit+0x22/0x60
[46934.330091]  [<ffffffff8168f090>] arp_send.part.16+0x30/0x40
[46934.330092]  [<ffffffff8168f1e5>] arp_solicit+0x115/0x2b0
[46934.330094]  [<ffffffff8160b5d7>] ? copy_skb_header+0x17/0xa0
[46934.330096]  [<ffffffff8162875a>] neigh_probe+0x4a/0x70
[46934.330097]  [<ffffffff8162979c>] __neigh_event_send+0xac/0x230
[46934.330098]  [<ffffffff8162a00b>] neigh_resolve_output+0x13b/0x220
[46934.330100]  [<ffffffff8165f120>] ? ip_forward_options+0x1c0/0x1c0
[46934.330101]  [<ffffffff81660478>] ip_finish_output+0x1f8/0x860
[46934.330102]  [<ffffffff81661f08>] ip_output+0x58/0x90
[46934.330103]  [<ffffffff81661602>] ? __ip_local_out+0xa2/0xb0
[46934.330104]  [<ffffffff81661640>] ip_local_out_sk+0x30/0x40
[46934.330105]  [<ffffffff81662a66>] ip_send_skb+0x16/0x50
[46934.330106]  [<ffffffff81662ad3>] ip_push_pending_frames+0x33/0x40
[46934.330107]  [<ffffffff8168854c>] raw_sendmsg+0x88c/0xa30
[46934.330110]  [<ffffffff81612b31>] ? skb_recv_datagram+0x41/0x60
[46934.330111]  [<ffffffff816875a9>] ? raw_recvmsg+0xa9/0x1f0
[46934.330113]  [<ffffffff816978d4>] inet_sendmsg+0x74/0xc0
[46934.330114]  [<ffffffff81697a9b>] ? inet_recvmsg+0x8b/0xb0
[46934.330115] bond0: Adding slave eth2
[46934.330116]  [<ffffffff8160357c>] sock_sendmsg+0x9c/0xe0
[46934.330118]  [<ffffffff81603248>] ?
move_addr_to_kernel.part.20+0x28/0x80
[46934.330121]  [<ffffffff811b4477>] ? might_fault+0x47/0x50
[46934.330122]  [<ffffffff816039b9>] ___sys_sendmsg+0x3a9/0x3c0
[46934.330125]  [<ffffffff8144a14a>] ? n_tty_write+0x3aa/0x530
[46934.330127]  [<ffffffff810d1ae4>] ? __wake_up+0x44/0x50
[46934.330129]  [<ffffffff81242b38>] ? fsnotify+0x238/0x310
[46934.330130]  [<ffffffff816048a1>] __sys_sendmsg+0x51/0x90
[46934.330131]  [<ffffffff816048f2>] SyS_sendmsg+0x12/0x20
[46934.330134]  [<ffffffff81738b29>] system_call_fastpath+0x16/0x1b
[46934.330144] Code: 48 8b 10 4c 89 ee 4c 89 ff e8 aa bc ff ff 31 c0 e9
1a ff ff ff 0f 1f 00 4c 89 ee 4c 89 ff e8 65 fb ff ff 31 d2 4c 89 ee 4c
89 ff <f7> b3 64 09 00 00 e8 02 bd ff ff 31 c0 e9 f2 fe ff ff 0f 1f 00
[46934.330146] RIP  [<ffffffffa0198c33>] bond_start_xmit+0x1c3/0x450
[bonding]
[46934.330146]  RSP <ffff88005b72b7f8>

CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
Fixes: 278b208 ("bonding: initial RCU conversion")
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
e39b8b0

@davet321 davet321 pushed a commit to davet321/rpi-linux that referenced this issue Aug 17, 2015

@gregkh Michal Hocko + gregkh mm, vmscan: Do not wait for page writeback for GFP_NOFS allocations
commit ecf5fc6 upstream.

Nikolay has reported a hang when a memcg reclaim got stuck with the
following backtrace:

PID: 18308  TASK: ffff883d7c9b0a30  CPU: 1   COMMAND: "rsync"
  #0 __schedule at ffffffff815ab152
  #1 schedule at ffffffff815ab76e
  #2 schedule_timeout at ffffffff815ae5e5
  #3 io_schedule_timeout at ffffffff815aad6a
  #4 bit_wait_io at ffffffff815abfc6
  #5 __wait_on_bit at ffffffff815abda5
  #6 wait_on_page_bit at ffffffff8111fd4f
  #7 shrink_page_list at ffffffff81135445
  #8 shrink_inactive_list at ffffffff81135845
  #9 shrink_lruvec at ffffffff81135ead
 #10 shrink_zone at ffffffff811360c3
 #11 shrink_zones at ffffffff81136eff
 #12 do_try_to_free_pages at ffffffff8113712f
 #13 try_to_free_mem_cgroup_pages at ffffffff811372be
 #14 try_charge at ffffffff81189423
 #15 mem_cgroup_try_charge at ffffffff8118c6f5
 #16 __add_to_page_cache_locked at ffffffff8112137d
 #17 add_to_page_cache_lru at ffffffff81121618
 #18 pagecache_get_page at ffffffff8112170b
 #19 grow_dev_page at ffffffff811c8297
 #20 __getblk_slow at ffffffff811c91d6
 #21 __getblk_gfp at ffffffff811c92c1
 #22 ext4_ext_grow_indepth at ffffffff8124565c
 #23 ext4_ext_create_new_leaf at ffffffff81246ca8
 #24 ext4_ext_insert_extent at ffffffff81246f09
 #25 ext4_ext_map_blocks at ffffffff8124a848
 #26 ext4_map_blocks at ffffffff8121a5b7
 #27 mpage_map_one_extent at ffffffff8121b1fa
 #28 mpage_map_and_submit_extent at ffffffff8121f07b
 #29 ext4_writepages at ffffffff8121f6d5
 #30 do_writepages at ffffffff8112c490
 #31 __filemap_fdatawrite_range at ffffffff81120199
 #32 filemap_flush at ffffffff8112041c
 #33 ext4_alloc_da_blocks at ffffffff81219da1
 #34 ext4_rename at ffffffff81229b91
 #35 ext4_rename2 at ffffffff81229e32
 #36 vfs_rename at ffffffff811a08a5
 #37 SYSC_renameat2 at ffffffff811a3ffc
 #38 sys_renameat2 at ffffffff811a408e
 #39 sys_rename at ffffffff8119e51e
 #40 system_call_fastpath at ffffffff815afa89

Dave Chinner has properly pointed out that this is a deadlock in the
reclaim code because ext4 doesn't submit pages which are marked by
PG_writeback right away.

The heuristic was introduced by commit e62e384 ("memcg: prevent OOM
with too many dirty pages") and it was applied only when may_enter_fs
was specified.  The code has been changed by c3b94f4 ("memcg:
further prevent OOM with too many dirty pages") which has removed the
__GFP_FS restriction with a reasoning that we do not get into the fs
code.  But this is not sufficient apparently because the fs doesn't
necessarily submit pages marked PG_writeback for IO right away.

ext4_bio_write_page calls io_submit_add_bh but that doesn't necessarily
submit the bio.  Instead it tries to map more pages into the bio and
mpage_map_one_extent might trigger memcg charge which might end up
waiting on a page which is marked PG_writeback but hasn't been submitted
yet so we would end up waiting for something that never finishes.

Fix this issue by replacing __GFP_IO by may_enter_fs check (for case 2)
before we go to wait on the writeback.  The page fault path, which is
the only path that triggers memcg oom killer since 3.12, shouldn't
require GFP_NOFS and so we shouldn't reintroduce the premature OOM
killer issue which was originally addressed by the heuristic.

As per David Chinner the xfs is doing similar thing since 2.6.15 already
so ext4 is not the only affected filesystem.  Moreover he notes:

: For example: IO completion might require unwritten extent conversion
: which executes filesystem transactions and GFP_NOFS allocations. The
: writeback flag on the pages can not be cleared until unwritten
: extent conversion completes. Hence memory reclaim cannot wait on
: page writeback to complete in GFP_NOFS context because it is not
: safe to do so, memcg reclaim or otherwise.

Cc: stable@vger.kernel.org # 3.9+
[tytso@mit.edu: corrected the control flow]
Fixes: c3b94f4 ("memcg: further prevent OOM with too many dirty pages")
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
7f488aa

@popcornmix popcornmix pushed a commit that referenced this issue Aug 20, 2015

@torvalds Michal Hocko + torvalds mm, vmscan: Do not wait for page writeback for GFP_NOFS allocations
Nikolay has reported a hang when a memcg reclaim got stuck with the
following backtrace:

PID: 18308  TASK: ffff883d7c9b0a30  CPU: 1   COMMAND: "rsync"
  #0 __schedule at ffffffff815ab152
  #1 schedule at ffffffff815ab76e
  #2 schedule_timeout at ffffffff815ae5e5
  #3 io_schedule_timeout at ffffffff815aad6a
  #4 bit_wait_io at ffffffff815abfc6
  #5 __wait_on_bit at ffffffff815abda5
  #6 wait_on_page_bit at ffffffff8111fd4f
  #7 shrink_page_list at ffffffff81135445
  #8 shrink_inactive_list at ffffffff81135845
  #9 shrink_lruvec at ffffffff81135ead
 #10 shrink_zone at ffffffff811360c3
 #11 shrink_zones at ffffffff81136eff
 #12 do_try_to_free_pages at ffffffff8113712f
 #13 try_to_free_mem_cgroup_pages at ffffffff811372be
 #14 try_charge at ffffffff81189423
 #15 mem_cgroup_try_charge at ffffffff8118c6f5
 #16 __add_to_page_cache_locked at ffffffff8112137d
 #17 add_to_page_cache_lru at ffffffff81121618
 #18 pagecache_get_page at ffffffff8112170b
 #19 grow_dev_page at ffffffff811c8297
 #20 __getblk_slow at ffffffff811c91d6
 #21 __getblk_gfp at ffffffff811c92c1
 #22 ext4_ext_grow_indepth at ffffffff8124565c
 #23 ext4_ext_create_new_leaf at ffffffff81246ca8
 #24 ext4_ext_insert_extent at ffffffff81246f09
 #25 ext4_ext_map_blocks at ffffffff8124a848
 #26 ext4_map_blocks at ffffffff8121a5b7
 #27 mpage_map_one_extent at ffffffff8121b1fa
 #28 mpage_map_and_submit_extent at ffffffff8121f07b
 #29 ext4_writepages at ffffffff8121f6d5
 #30 do_writepages at ffffffff8112c490
 #31 __filemap_fdatawrite_range at ffffffff81120199
 #32 filemap_flush at ffffffff8112041c
 #33 ext4_alloc_da_blocks at ffffffff81219da1
 #34 ext4_rename at ffffffff81229b91
 #35 ext4_rename2 at ffffffff81229e32
 #36 vfs_rename at ffffffff811a08a5
 #37 SYSC_renameat2 at ffffffff811a3ffc
 #38 sys_renameat2 at ffffffff811a408e
 #39 sys_rename at ffffffff8119e51e
 #40 system_call_fastpath at ffffffff815afa89

Dave Chinner has properly pointed out that this is a deadlock in the
reclaim code because ext4 doesn't submit pages which are marked by
PG_writeback right away.

The heuristic was introduced by commit e62e384 ("memcg: prevent OOM
with too many dirty pages") and it was applied only when may_enter_fs
was specified.  The code has been changed by c3b94f4 ("memcg:
further prevent OOM with too many dirty pages") which has removed the
__GFP_FS restriction with a reasoning that we do not get into the fs
code.  But this is not sufficient apparently because the fs doesn't
necessarily submit pages marked PG_writeback for IO right away.

ext4_bio_write_page calls io_submit_add_bh but that doesn't necessarily
submit the bio.  Instead it tries to map more pages into the bio and
mpage_map_one_extent might trigger memcg charge which might end up
waiting on a page which is marked PG_writeback but hasn't been submitted
yet so we would end up waiting for something that never finishes.

Fix this issue by replacing __GFP_IO by may_enter_fs check (for case 2)
before we go to wait on the writeback.  The page fault path, which is
the only path that triggers memcg oom killer since 3.12, shouldn't
require GFP_NOFS and so we shouldn't reintroduce the premature OOM
killer issue which was originally addressed by the heuristic.

As per David Chinner the xfs is doing similar thing since 2.6.15 already
so ext4 is not the only affected filesystem.  Moreover he notes:

: For example: IO completion might require unwritten extent conversion
: which executes filesystem transactions and GFP_NOFS allocations. The
: writeback flag on the pages can not be cleared until unwritten
: extent conversion completes. Hence memory reclaim cannot wait on
: page writeback to complete in GFP_NOFS context because it is not
: safe to do so, memcg reclaim or otherwise.

Cc: stable@vger.kernel.org # 3.9+
[tytso@mit.edu: corrected the control flow]
Fixes: c3b94f4 ("memcg: further prevent OOM with too many dirty pages")
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ecf5fc6

@popcornmix popcornmix pushed a commit that referenced this issue Aug 20, 2015

@torvalds Wanpeng Li + torvalds mm/hwpoison: fix panic due to split huge zero page
Bug:

  ------------[ cut here ]------------
  kernel BUG at mm/huge_memory.c:1957!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: snd_hda_codec_hdmi i915 rpcsec_gss_krb5 snd_hda_codec_realtek snd_hda_codec_generic nfsv4 dns_re
  CPU: 2 PID: 2576 Comm: test_huge Not tainted 4.2.0-rc5-mm1+ #27
  Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015
  task: ffff880204e3d600 ti: ffff8800db16c000 task.ti: ffff8800db16c000
  RIP: split_huge_page_to_list+0xdb/0x120
  Call Trace:
    memory_failure+0x32e/0x7c0
    madvise_hwpoison+0x8b/0x160
    SyS_madvise+0x40/0x240
    ? do_page_fault+0x37/0x90
    entry_SYSCALL_64_fastpath+0x12/0x71
  Code: ff f0 41 ff 4c 24 30 74 0d 31 c0 48 83 c4 08 5b 41 5c 41 5d c9 c3 4c 89 e7 e8 e2 58 fd ff 48 83 c4 08 31 c0
  RIP  split_huge_page_to_list+0xdb/0x120
   RSP <ffff8800db16fde8>
  ---[ end trace aee7ce0df8e44076 ]---

Testcase:

    #define _GNU_SOURCE
    #include <stdlib.h>
    #include <stdio.h>
    #include <sys/mman.h>
    #include <unistd.h>
    #include <fcntl.h>
    #include <sys/types.h>
    #include <errno.h>
    #include <string.h>

    #define MB 1024*1024

    int main(void)
    {
            char *mem;

            posix_memalign((void **)&mem, 2 * MB, 200 * MB);

            madvise(mem, 200 * MB, MADV_HWPOISON);

            free(mem);

            return 0;
    }

Huge zero page is allocated if page fault w/o FAULT_FLAG_WRITE flag.
The get_user_pages_fast() which called in madvise_hwpoison() will get
huge zero page if the page is not allocated before.  Huge zero page is a
tranparent huge page, however, it is not an anonymous page.
memory_failure will split the huge zero page and trigger
BUG_ON(is_huge_zero_page(page));

After commit 98ed2b0 ("mm/memory-failure: give up error handling
for non-tail-refcounted thp"), memory_failure will not catch non anon
thp from madvise_hwpoison path and this bug occur.

Fix it by catching non anon thp in memory_failure in order to not split
huge zero page in madvise_hwpoison path.

After this patch:

  Injecting memory failure for page 0x202800 at 0x7fd8ae800000
  MCE: 0x202800: non anonymous thp
  [...]

[akpm@linux-foundation.org: remove second split, per Wanpeng]
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
7f6bf39

@popcornmix popcornmix pushed a commit that referenced this issue Nov 28, 2016

@rkrcmar rkrcmar KVM: x86: check for pic and ioapic presence before use
Split irqchip allows pic and ioapic routes to be used without them being
created, which results in NULL access.  Check for NULL and avoid it.
(The setup is too racy for a nicer solutions.)

Found by syzkaller:

  general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
  Dumping ftrace buffer:
     (ftrace buffer empty)
  Modules linked in:
  CPU: 3 PID: 11923 Comm: kworker/3:2 Not tainted 4.9.0-rc5+ #27
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Workqueue: events irqfd_inject
  task: ffff88006a06c7c0 task.stack: ffff880068638000
  RIP: 0010:[...]  [...] __lock_acquire+0xb35/0x3380 kernel/locking/lockdep.c:3221
  RSP: 0000:ffff88006863ea20  EFLAGS: 00010006
  RAX: dffffc0000000000 RBX: dffffc0000000000 RCX: 0000000000000000
  RDX: 0000000000000039 RSI: 0000000000000000 RDI: 1ffff1000d0c7d9e
  RBP: ffff88006863ef58 R08: 0000000000000001 R09: 0000000000000000
  R10: 00000000000001c8 R11: 0000000000000000 R12: ffff88006a06c7c0
  R13: 0000000000000001 R14: ffffffff8baab1a0 R15: 0000000000000001
  FS:  0000000000000000(0000) GS:ffff88006d100000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000004abdd0 CR3: 000000003e2f2000 CR4: 00000000000026e0
  Stack:
   ffffffff894d0098 1ffff1000d0c7d56 ffff88006863ecd0 dffffc0000000000
   ffff88006a06c7c0 0000000000000000 ffff88006863ecf8 0000000000000082
   0000000000000000 ffffffff815dd7c1 ffffffff00000000 ffffffff00000000
  Call Trace:
   [...] lock_acquire+0x2a2/0x790 kernel/locking/lockdep.c:3746
   [...] __raw_spin_lock include/linux/spinlock_api_smp.h:144
   [...] _raw_spin_lock+0x38/0x50 kernel/locking/spinlock.c:151
   [...] spin_lock include/linux/spinlock.h:302
   [...] kvm_ioapic_set_irq+0x4c/0x100 arch/x86/kvm/ioapic.c:379
   [...] kvm_set_ioapic_irq+0x8f/0xc0 arch/x86/kvm/irq_comm.c:52
   [...] kvm_set_irq+0x239/0x640 arch/x86/kvm/../../../virt/kvm/irqchip.c:101
   [...] irqfd_inject+0xb4/0x150 arch/x86/kvm/../../../virt/kvm/eventfd.c:60
   [...] process_one_work+0xb40/0x1ba0 kernel/workqueue.c:2096
   [...] worker_thread+0x214/0x18a0 kernel/workqueue.c:2230
   [...] kthread+0x328/0x3e0 kernel/kthread.c:209
   [...] ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Fixes: 49df639 ("KVM: x86: Split the APIC from the rest of IRQCHIP.")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
df49289

@popcornmix popcornmix pushed a commit that referenced this issue Dec 2, 2016

@rkrcmar @gregkh rkrcmar + gregkh KVM: x86: check for pic and ioapic presence before use
commit df49289 upstream.

Split irqchip allows pic and ioapic routes to be used without them being
created, which results in NULL access.  Check for NULL and avoid it.
(The setup is too racy for a nicer solutions.)

Found by syzkaller:

  general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
  Dumping ftrace buffer:
     (ftrace buffer empty)
  Modules linked in:
  CPU: 3 PID: 11923 Comm: kworker/3:2 Not tainted 4.9.0-rc5+ #27
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Workqueue: events irqfd_inject
  task: ffff88006a06c7c0 task.stack: ffff880068638000
  RIP: 0010:[...]  [...] __lock_acquire+0xb35/0x3380 kernel/locking/lockdep.c:3221
  RSP: 0000:ffff88006863ea20  EFLAGS: 00010006
  RAX: dffffc0000000000 RBX: dffffc0000000000 RCX: 0000000000000000
  RDX: 0000000000000039 RSI: 0000000000000000 RDI: 1ffff1000d0c7d9e
  RBP: ffff88006863ef58 R08: 0000000000000001 R09: 0000000000000000
  R10: 00000000000001c8 R11: 0000000000000000 R12: ffff88006a06c7c0
  R13: 0000000000000001 R14: ffffffff8baab1a0 R15: 0000000000000001
  FS:  0000000000000000(0000) GS:ffff88006d100000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000004abdd0 CR3: 000000003e2f2000 CR4: 00000000000026e0
  Stack:
   ffffffff894d0098 1ffff1000d0c7d56 ffff88006863ecd0 dffffc0000000000
   ffff88006a06c7c0 0000000000000000 ffff88006863ecf8 0000000000000082
   0000000000000000 ffffffff815dd7c1 ffffffff00000000 ffffffff00000000
  Call Trace:
   [...] lock_acquire+0x2a2/0x790 kernel/locking/lockdep.c:3746
   [...] __raw_spin_lock include/linux/spinlock_api_smp.h:144
   [...] _raw_spin_lock+0x38/0x50 kernel/locking/spinlock.c:151
   [...] spin_lock include/linux/spinlock.h:302
   [...] kvm_ioapic_set_irq+0x4c/0x100 arch/x86/kvm/ioapic.c:379
   [...] kvm_set_ioapic_irq+0x8f/0xc0 arch/x86/kvm/irq_comm.c:52
   [...] kvm_set_irq+0x239/0x640 arch/x86/kvm/../../../virt/kvm/irqchip.c:101
   [...] irqfd_inject+0xb4/0x150 arch/x86/kvm/../../../virt/kvm/eventfd.c:60
   [...] process_one_work+0xb40/0x1ba0 kernel/workqueue.c:2096
   [...] worker_thread+0x214/0x18a0 kernel/workqueue.c:2230
   [...] kthread+0x328/0x3e0 kernel/kthread.c:209
   [...] ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: 49df639 ("KVM: x86: Split the APIC from the rest of IRQCHIP.")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
32fe669

@popcornmix popcornmix pushed a commit that referenced this issue Dec 2, 2016

@rkrcmar @gregkh rkrcmar + gregkh KVM: x86: check for pic and ioapic presence before use
commit df49289 upstream.

Split irqchip allows pic and ioapic routes to be used without them being
created, which results in NULL access.  Check for NULL and avoid it.
(The setup is too racy for a nicer solutions.)

Found by syzkaller:

  general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
  Dumping ftrace buffer:
     (ftrace buffer empty)
  Modules linked in:
  CPU: 3 PID: 11923 Comm: kworker/3:2 Not tainted 4.9.0-rc5+ #27
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Workqueue: events irqfd_inject
  task: ffff88006a06c7c0 task.stack: ffff880068638000
  RIP: 0010:[...]  [...] __lock_acquire+0xb35/0x3380 kernel/locking/lockdep.c:3221
  RSP: 0000:ffff88006863ea20  EFLAGS: 00010006
  RAX: dffffc0000000000 RBX: dffffc0000000000 RCX: 0000000000000000
  RDX: 0000000000000039 RSI: 0000000000000000 RDI: 1ffff1000d0c7d9e
  RBP: ffff88006863ef58 R08: 0000000000000001 R09: 0000000000000000
  R10: 00000000000001c8 R11: 0000000000000000 R12: ffff88006a06c7c0
  R13: 0000000000000001 R14: ffffffff8baab1a0 R15: 0000000000000001
  FS:  0000000000000000(0000) GS:ffff88006d100000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000004abdd0 CR3: 000000003e2f2000 CR4: 00000000000026e0
  Stack:
   ffffffff894d0098 1ffff1000d0c7d56 ffff88006863ecd0 dffffc0000000000
   ffff88006a06c7c0 0000000000000000 ffff88006863ecf8 0000000000000082
   0000000000000000 ffffffff815dd7c1 ffffffff00000000 ffffffff00000000
  Call Trace:
   [...] lock_acquire+0x2a2/0x790 kernel/locking/lockdep.c:3746
   [...] __raw_spin_lock include/linux/spinlock_api_smp.h:144
   [...] _raw_spin_lock+0x38/0x50 kernel/locking/spinlock.c:151
   [...] spin_lock include/linux/spinlock.h:302
   [...] kvm_ioapic_set_irq+0x4c/0x100 arch/x86/kvm/ioapic.c:379
   [...] kvm_set_ioapic_irq+0x8f/0xc0 arch/x86/kvm/irq_comm.c:52
   [...] kvm_set_irq+0x239/0x640 arch/x86/kvm/../../../virt/kvm/irqchip.c:101
   [...] irqfd_inject+0xb4/0x150 arch/x86/kvm/../../../virt/kvm/eventfd.c:60
   [...] process_one_work+0xb40/0x1ba0 kernel/workqueue.c:2096
   [...] worker_thread+0x214/0x18a0 kernel/workqueue.c:2230
   [...] kthread+0x328/0x3e0 kernel/kthread.c:209
   [...] ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: 49df639 ("KVM: x86: Split the APIC from the rest of IRQCHIP.")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
341f973

XECDesign referenced this issue in Hexxeh/rpi-firmware Jan 10, 2017

Closed

Kernel 4.8.13 kernel.img format changed. #132

@popcornmix popcornmix pushed a commit that referenced this issue Feb 6, 2017

@rabinv @smfrench rabinv + smfrench cifs: initialize file_info_lock
Reviewed-by: Jeff Layton <jlayton@redhat.com>
CC: Stable <stable@vger.kernel.org>

file_info_lock is not initalized in initiate_cifs_search(), leading to the
following splat after a simple "mount.cifs ... dir && ls dir/":

 BUG: spinlock bad magic on CPU#0, ls/486
  lock: 0xffff880009301110, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
 CPU: 0 PID: 486 Comm: ls Not tainted 4.9.0 #27
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
  ffffc900042f3db0 ffffffff81327533 0000000000000000 ffff880009301110
  ffffc900042f3dd0 ffffffff810baf75 ffff880009301110 ffffffff817ae077
  ffffc900042f3df0 ffffffff810baff6 ffff880009301110 ffff880008d69900
 Call Trace:
  [<ffffffff81327533>] dump_stack+0x65/0x92
  [<ffffffff810baf75>] spin_dump+0x85/0xe0
  [<ffffffff810baff6>] spin_bug+0x26/0x30
  [<ffffffff810bb159>] do_raw_spin_lock+0xe9/0x130
  [<ffffffff8159ad2f>] _raw_spin_lock+0x1f/0x30
  [<ffffffff8127e50d>] cifs_closedir+0x4d/0x100
  [<ffffffff81181cfd>] __fput+0x5d/0x160
  [<ffffffff81181e3e>] ____fput+0xe/0x10
  [<ffffffff8109410e>] task_work_run+0x7e/0xa0
  [<ffffffff81002512>] exit_to_usermode_loop+0x92/0xa0
  [<ffffffff810026f9>] syscall_return_slowpath+0x49/0x50
  [<ffffffff8159b484>] entry_SYSCALL_64_fastpath+0xa7/0xa9

Fixes: 3afca26 ("Clarify locking of cifs file and tcon structures and make more granular")
Signed-off-by: Rabin Vincent <rabinv@axis.com>
Signed-off-by: Steve French <smfrench@gmail.com>
81ddd8c

@popcornmix popcornmix pushed a commit that referenced this issue Feb 10, 2017

@rabinv @gregkh rabinv + gregkh cifs: initialize file_info_lock
commit 81ddd8c upstream.

Reviewed-by: Jeff Layton <jlayton@redhat.com>

file_info_lock is not initalized in initiate_cifs_search(), leading to the
following splat after a simple "mount.cifs ... dir && ls dir/":

 BUG: spinlock bad magic on CPU#0, ls/486
  lock: 0xffff880009301110, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
 CPU: 0 PID: 486 Comm: ls Not tainted 4.9.0 #27
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
  ffffc900042f3db0 ffffffff81327533 0000000000000000 ffff880009301110
  ffffc900042f3dd0 ffffffff810baf75 ffff880009301110 ffffffff817ae077
  ffffc900042f3df0 ffffffff810baff6 ffff880009301110 ffff880008d69900
 Call Trace:
  [<ffffffff81327533>] dump_stack+0x65/0x92
  [<ffffffff810baf75>] spin_dump+0x85/0xe0
  [<ffffffff810baff6>] spin_bug+0x26/0x30
  [<ffffffff810bb159>] do_raw_spin_lock+0xe9/0x130
  [<ffffffff8159ad2f>] _raw_spin_lock+0x1f/0x30
  [<ffffffff8127e50d>] cifs_closedir+0x4d/0x100
  [<ffffffff81181cfd>] __fput+0x5d/0x160
  [<ffffffff81181e3e>] ____fput+0xe/0x10
  [<ffffffff8109410e>] task_work_run+0x7e/0xa0
  [<ffffffff81002512>] exit_to_usermode_loop+0x92/0xa0
  [<ffffffff810026f9>] syscall_return_slowpath+0x49/0x50
  [<ffffffff8159b484>] entry_SYSCALL_64_fastpath+0xa7/0xa9

Fixes: 3afca26 ("Clarify locking of cifs file and tcon structures and make more granular")
Signed-off-by: Rabin Vincent <rabinv@axis.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
920bba1

@popcornmix popcornmix pushed a commit that referenced this issue Feb 10, 2017

@rabinv @gregkh rabinv + gregkh cifs: initialize file_info_lock
commit 81ddd8c upstream.

Reviewed-by: Jeff Layton <jlayton@redhat.com>

file_info_lock is not initalized in initiate_cifs_search(), leading to the
following splat after a simple "mount.cifs ... dir && ls dir/":

 BUG: spinlock bad magic on CPU#0, ls/486
  lock: 0xffff880009301110, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
 CPU: 0 PID: 486 Comm: ls Not tainted 4.9.0 #27
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
  ffffc900042f3db0 ffffffff81327533 0000000000000000 ffff880009301110
  ffffc900042f3dd0 ffffffff810baf75 ffff880009301110 ffffffff817ae077
  ffffc900042f3df0 ffffffff810baff6 ffff880009301110 ffff880008d69900
 Call Trace:
  [<ffffffff81327533>] dump_stack+0x65/0x92
  [<ffffffff810baf75>] spin_dump+0x85/0xe0
  [<ffffffff810baff6>] spin_bug+0x26/0x30
  [<ffffffff810bb159>] do_raw_spin_lock+0xe9/0x130
  [<ffffffff8159ad2f>] _raw_spin_lock+0x1f/0x30
  [<ffffffff8127e50d>] cifs_closedir+0x4d/0x100
  [<ffffffff81181cfd>] __fput+0x5d/0x160
  [<ffffffff81181e3e>] ____fput+0xe/0x10
  [<ffffffff8109410e>] task_work_run+0x7e/0xa0
  [<ffffffff81002512>] exit_to_usermode_loop+0x92/0xa0
  [<ffffffff810026f9>] syscall_return_slowpath+0x49/0x50
  [<ffffffff8159b484>] entry_SYSCALL_64_fastpath+0xa7/0xa9

Fixes: 3afca26 ("Clarify locking of cifs file and tcon structures and make more granular")
Signed-off-by: Rabin Vincent <rabinv@axis.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
9e25599
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment