Bootloader UART setup #18

Closed
bootc opened this Issue May 9, 2012 · 10 comments

Comments

Projects
None yet
4 participants
Contributor

bootc commented May 9, 2012

Could the bootloader set up the PL011 UART before it hands over to the kernel? Currently I have code in bootc/linux@a91fe13 that sets up the serial port during the kernel decompression stage so that it can be used during decompression and then during early printk.

It would really be ideal if that was done outside the kernel as is more customary, and would mean you could boot compressed or uncompressed kernels and get the same behaviour.

Collaborator

popcornmix commented May 9, 2012

You would like the code inside your arch_decomp_setup (5 peripheral writes) to be done by GPU before kernel starts?
Sounds okay.

Contributor

bootc commented May 9, 2012

Yep, just so the UART is good to go. Perhaps even have a config.txt option to change the baud rate if someone wants a different speed.

Collaborator

popcornmix commented May 11, 2012

This link is no longer valid:
bootc/linux@a91fe13

I'd quite like to see the code in that commit.

Contributor

bootc commented May 12, 2012

Ahh sorry I had a bit of a reorganisation on github so it's moved to:
bootc/linux-rpi-orig@a91fe13

Contributor

bootc commented May 12, 2012

I should add, I will happily provide you with the generated ARM assembly if that's any help.

Collaborator

popcornmix commented May 15, 2012

Can you try:
http://dl.dropbox.com/u/3669512/start.elf

which should set up UART to 115200 with a 3MHz clock as requested elsewhere. You can modify these:
CONFIG(init_uart_baud)
CONFIG(init_uart_clock)

Contributor

nomis commented May 15, 2012

This is the same start.elf from #16 and writes to the UART without prior initialisation by the kernel worked too.

Collaborator

popcornmix commented May 16, 2012

Firmware now pushed to github. Please close issue if happy.

Contributor

bootc commented May 18, 2012

I haven't actually managed to test this myself yet, but Simon has and assures me it's fine. So I'll close it for now and reopen it if I run into any trouble.

Thanks!
Chris

@bootc bootc closed this May 18, 2012

PeterHuewe pushed a commit to PeterHuewe/linux that referenced this issue May 18, 2012

mm: compaction: check pfn_valid when entering a new MAX_ORDER_NR_PAGE…
…S block during isolation for migration

When isolating for migration, migration starts at the start of a zone
which is not necessarily pageblock aligned.  Further, it stops isolating
when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally
not aligned.  This allows isolate_migratepages() to call pfn_to_page() on
an invalid PFN which can result in a crash.  This was originally reported
against a 3.0-based kernel with the following trace in a crash dump.

PID: 9902   TASK: d47aecd0  CPU: 0   COMMAND: "memcg_process_s"
 #0 [d72d3ad0] crash_kexec at c028cfdb
 #1 [d72d3b24] oops_end at c05c5322
 #2 [d72d3b38] __bad_area_nosemaphore at c0227e60
 #3 [d72d3bec] bad_area at c0227fb6
 #4 [d72d3c00] do_page_fault at c05c72ec
 #5 [d72d3c80] error_code (via page_fault) at c05c47a4
    EAX: 00000000  EBX: 000c0000  ECX: 00000001  EDX: 00000807  EBP: 000c0000
    DS:  007b      ESI: 00000001  ES:  007b      EDI: f3000a80  GS:  6f50
    CS:  0060      EIP: c030b15a  ERR: ffffffff  EFLAGS: 00010002
 #6 [d72d3cb4] isolate_migratepages at c030b15a
 #7 [d72d3d14] zone_watermark_ok at c02d26cb
 #8 [d72d3d2c] compact_zone at c030b8de
 #9 [d72d3d68] compact_zone_order at c030bba1
#10 [d72d3db4] try_to_compact_pages at c030bc84
#11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7
#12 [d72d3e08] __alloc_pages_slowpath at c02d66c7
#13 [d72d3e78] __alloc_pages_nodemask at c02d6a97
#14 [d72d3eb8] alloc_pages_vma at c030a845
#15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb
#16 [d72d3f00] handle_mm_fault at c02f36c6
#17 [d72d3f30] do_page_fault at c05c70ed
#18 [d72d3fb0] error_code (via page_fault) at c05c47a4
    EAX: b71ff00  EBX: 00000001  ECX: 00001600  EDX: 00000431
    DS:  007b      ESI: 08048950  ES:  007b      EDI: bfaa3788
    SS:  007b      ESP: bfaa36e0  EBP: bfaa3828  GS:  6f50
    CS:  0073      EIP: 080487c8  ERR: ffffffff  EFLAGS: 00010202

It was also reported by Herbert van den Bergh against 3.1-based kernel
with the following snippet from the console log.

BUG: unable to handle kernel paging request at 01c00008
IP: [<c0522399>] isolate_migratepages+0x119/0x390
*pdpt = 000000002f7ce001 *pde = 0000000000000000

It is expected that it also affects 3.2.x and current mainline.

The problem is that pfn_valid is only called on the first PFN being
checked and that PFN is not necessarily aligned.  Lets say we have a case
like this

H = MAX_ORDER_NR_PAGES boundary
| = pageblock boundary
m = cc->migrate_pfn
f = cc->free_pfn
o = memory hole

H------|------H------|----m-Hoooooo|ooooooH-f----|------H

The migrate_pfn is just below a memory hole and the free scanner is beyond
the hole.  When isolate_migratepages started, it scans from migrate_pfn to
migrate_pfn+pageblock_nr_pages which is now in a memory hole.  It checks
pfn_valid() on the first PFN but then scans into the hole where there are
not necessarily valid struct pages.

This patch ensures that isolate_migratepages calls pfn_valid when
necessary.

Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

PeterHuewe pushed a commit to PeterHuewe/linux that referenced this issue May 18, 2012

ixgbe: do not update real num queues when netdev is going away
If the netdev is already in NETREG_UNREGISTERING/_UNREGISTERED state, do not
update the real num tx queues. netdev_queue_update_kobjects() is already
called via remove_queue_kobjects() at NETREG_UNREGISTERING time. So, when
upper layer driver, e.g., FCoE protocol stack is monitoring the netdev
event of NETDEV_UNREGISTER and calls back to LLD ndo_fcoe_disable() to remove
extra queues allocated for FCoE, the associated txq sysfs kobjects are already
removed, and trying to update the real num queues would cause something like
below:

...
PID: 25138  TASK: ffff88021e64c440  CPU: 3   COMMAND: "kworker/3:3"
 #0 [ffff88021f007760] machine_kexec at ffffffff810226d9
 #1 [ffff88021f0077d0] crash_kexec at ffffffff81089d2d
 #2 [ffff88021f0078a0] oops_end at ffffffff813bca78
 #3 [ffff88021f0078d0] no_context at ffffffff81029e72
 #4 [ffff88021f007920] __bad_area_nosemaphore at ffffffff8102a155
 #5 [ffff88021f0079f0] bad_area_nosemaphore at ffffffff8102a23e
 #6 [ffff88021f007a00] do_page_fault at ffffffff813bf32e
 #7 [ffff88021f007b10] page_fault at ffffffff813bc045
    [exception RIP: sysfs_find_dirent+17]
    RIP: ffffffff81178611  RSP: ffff88021f007bc0  RFLAGS: 00010246
    RAX: ffff88021e64c440  RBX: ffffffff8156cc63  RCX: 0000000000000004
    RDX: ffffffff8156cc63  RSI: 0000000000000000  RDI: 0000000000000000
    RBP: ffff88021f007be0   R8: 0000000000000004   R9: 0000000000000008
    R10: ffffffff816fed00  R11: 0000000000000004  R12: 0000000000000000
    R13: ffffffff8156cc63  R14: 0000000000000000  R15: ffff8802222a0000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #8 [ffff88021f007be8] sysfs_get_dirent at ffffffff81178c07
 #9 [ffff88021f007c18] sysfs_remove_group at ffffffff8117ac27
#10 [ffff88021f007c48] netdev_queue_update_kobjects at ffffffff813178f9
#11 [ffff88021f007c88] netif_set_real_num_tx_queues at ffffffff81303e38
#12 [ffff88021f007cc8] ixgbe_set_num_queues at ffffffffa0249763 [ixgbe]
#13 [ffff88021f007cf8] ixgbe_init_interrupt_scheme at ffffffffa024ea89 [ixgbe]
#14 [ffff88021f007d48] ixgbe_fcoe_disable at ffffffffa0267113 [ixgbe]
#15 [ffff88021f007d68] vlan_dev_fcoe_disable at ffffffffa014fef5 [8021q]
#16 [ffff88021f007d78] fcoe_interface_cleanup at ffffffffa02b7dfd [fcoe]
#17 [ffff88021f007df8] fcoe_destroy_work at ffffffffa02b7f08 [fcoe]
#18 [ffff88021f007e18] process_one_work at ffffffff8105d7ca
#19 [ffff88021f007e68] worker_thread at ffffffff81060513
#20 [ffff88021f007ee8] kthread at ffffffff810648b6
#21 [ffff88021f007f48] kernel_thread_helper at ffffffff813c40f4

Signed-off-by: Yi Zou <yi.zou@intel.com>
Tested-by: Ross Brattain <ross.b.brattain@intel.com>
Tested-by: Stephen Ko <stephen.s.ko@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

andatche pushed a commit to andatche/raspi-linux that referenced this issue May 29, 2012

mm: compaction: check pfn_valid when entering a new MAX_ORDER_NR_PAGE…
…S block during isolation for migration

commit 0bf380b upstream.

When isolating for migration, migration starts at the start of a zone
which is not necessarily pageblock aligned.  Further, it stops isolating
when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally
not aligned.  This allows isolate_migratepages() to call pfn_to_page() on
an invalid PFN which can result in a crash.  This was originally reported
against a 3.0-based kernel with the following trace in a crash dump.

PID: 9902   TASK: d47aecd0  CPU: 0   COMMAND: "memcg_process_s"
 #0 [d72d3ad0] crash_kexec at c028cfdb
 #1 [d72d3b24] oops_end at c05c5322
 #2 [d72d3b38] __bad_area_nosemaphore at c0227e60
 #3 [d72d3bec] bad_area at c0227fb6
 #4 [d72d3c00] do_page_fault at c05c72ec
 #5 [d72d3c80] error_code (via page_fault) at c05c47a4
    EAX: 00000000  EBX: 000c0000  ECX: 00000001  EDX: 00000807  EBP: 000c0000
    DS:  007b      ESI: 00000001  ES:  007b      EDI: f3000a80  GS:  6f50
    CS:  0060      EIP: c030b15a  ERR: ffffffff  EFLAGS: 00010002
 #6 [d72d3cb4] isolate_migratepages at c030b15a
 #7 [d72d3d14] zone_watermark_ok at c02d26cb
 #8 [d72d3d2c] compact_zone at c030b8de
 #9 [d72d3d68] compact_zone_order at c030bba1
#10 [d72d3db4] try_to_compact_pages at c030bc84
#11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7
#12 [d72d3e08] __alloc_pages_slowpath at c02d66c7
#13 [d72d3e78] __alloc_pages_nodemask at c02d6a97
#14 [d72d3eb8] alloc_pages_vma at c030a845
#15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb
#16 [d72d3f00] handle_mm_fault at c02f36c6
#17 [d72d3f30] do_page_fault at c05c70ed
#18 [d72d3fb0] error_code (via page_fault) at c05c47a4
    EAX: b71ff00  EBX: 00000001  ECX: 00001600  EDX: 00000431
    DS:  007b      ESI: 08048950  ES:  007b      EDI: bfaa3788
    SS:  007b      ESP: bfaa36e0  EBP: bfaa3828  GS:  6f50
    CS:  0073      EIP: 080487c8  ERR: ffffffff  EFLAGS: 00010202

It was also reported by Herbert van den Bergh against 3.1-based kernel
with the following snippet from the console log.

BUG: unable to handle kernel paging request at 01c00008
IP: [<c0522399>] isolate_migratepages+0x119/0x390
*pdpt = 000000002f7ce001 *pde = 0000000000000000

It is expected that it also affects 3.2.x and current mainline.

The problem is that pfn_valid is only called on the first PFN being
checked and that PFN is not necessarily aligned.  Lets say we have a case
like this

H = MAX_ORDER_NR_PAGES boundary
| = pageblock boundary
m = cc->migrate_pfn
f = cc->free_pfn
o = memory hole

H------|------H------|----m-Hoooooo|ooooooH-f----|------H

The migrate_pfn is just below a memory hole and the free scanner is beyond
the hole.  When isolate_migratepages started, it scans from migrate_pfn to
migrate_pfn+pageblock_nr_pages which is now in a memory hole.  It checks
pfn_valid() on the first PFN but then scans into the hole where there are
not necessarily valid struct pages.

This patch ensures that isolate_migratepages calls pfn_valid when
necessary.

Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

popcornmix pushed a commit to popcornmix/linux that referenced this issue Aug 16, 2012

nfs: skip commit in releasepage if we're freeing memory for fs-relate…
…d reasons

commit 5cf02d0 upstream.

We've had some reports of a deadlock where rpciod ends up with a stack
trace like this:

    PID: 2507   TASK: ffff88103691ab40  CPU: 14  COMMAND: "rpciod/14"
     #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9
     #1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs]
     #2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f
     #3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8
     #4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs]
     #5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs]
     #6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670
     #7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271
     #8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638
     #9 [ffff8810343bf818] shrink_zone at ffffffff8112788f
    #10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e
    #11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f
    #12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad
    #13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942
    #14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a
    #15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9
    #16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b
    #17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808
    #18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c
    #19 [ffff8810343bfce8] inet_create at ffffffff81483ba6
    #20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7
    #21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc]
    #22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc]
    #23 [ffff8810343bfe38] worker_thread at ffffffff810887d0
    #24 [ffff8810343bfee8] kthread at ffffffff8108dd96
    #25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca

rpciod is trying to allocate memory for a new socket to talk to the
server. The VM ends up calling ->releasepage to get more memory, and it
tries to do a blocking commit. That commit can't succeed however without
a connected socket, so we deadlock.

Fix this by setting PF_FSTRANS on the workqueue task prior to doing the
socket allocation, and having nfs_release_page check for that flag when
deciding whether to do a commit call. Also, set PF_FSTRANS
unconditionally in rpc_async_schedule since that function can also do
allocations sometimes.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>

Stonie commented Sep 6, 2012

This is not working for me, please see: #105

popcornmix pushed a commit that referenced this issue Oct 13, 2012

x86: Remove some noise from boot log when starting cpus
Printing the "start_ip" for every secondary cpu is very noisy on a large
system - and doesn't add any value. Drop this message.

Console log before:
Booting Node   0, Processors  #1
smpboot cpu 1: start_ip = 96000
 #2
smpboot cpu 2: start_ip = 96000
 #3
smpboot cpu 3: start_ip = 96000
 #4
smpboot cpu 4: start_ip = 96000
       ...
 #31
smpboot cpu 31: start_ip = 96000
Brought up 32 CPUs

Console log after:
Booting Node   0, Processors  #1 #2 #3 #4 #5 #6 #7 Ok.
Booting Node   1, Processors  #8 #9 #10 #11 #12 #13 #14 #15 Ok.
Booting Node   0, Processors  #16 #17 #18 #19 #20 #21 #22 #23 Ok.
Booting Node   1, Processors  #24 #25 #26 #27 #28 #29 #30 #31
Brought up 32 CPUs

Acked-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4f452eb42507460426@agluck-desktop.sc.intel.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>

popcornmix pushed a commit that referenced this issue Oct 13, 2012

ath9k: Cleanup MCI init/deinit routines
This patch simplifies the buffer allocation functions
for MCI and removes unneeded memset calls. Also, a couple
of unused variables are removed and a memory leak in DMA
allocation is fixed.

[ 1263.788267] WARNING: at /home/sujith/dev/wireless-testing/lib/dma-debug.c:875 check_unmap+0x173/0x7e0()
[ 1263.788273] ath9k 0000:06:00.0: DMA-API: device driver frees DMA memory with different size
               [device address=0x0000000071908000] [map size=512 bytes] [unmap size=256 bytes]
[ 1263.788345] Pid: 774, comm: rmmod Tainted: G        W  O 3.3.0-rc3-wl #18
[ 1263.788348] Call Trace:
[ 1263.788355]  [<ffffffff8105110f>] warn_slowpath_common+0x7f/0xc0
[ 1263.788359]  [<ffffffff81051206>] warn_slowpath_fmt+0x46/0x50
[ 1263.788363]  [<ffffffff8125a713>] check_unmap+0x173/0x7e0
[ 1263.788368]  [<ffffffff8123fc22>] ? prio_tree_left+0x32/0xc0
[ 1263.788373]  [<ffffffff8125aded>] debug_dma_free_coherent+0x6d/0x80
[ 1263.788381]  [<ffffffffa0701c3c>] ath_mci_cleanup+0x7c/0x110 [ath9k]
[ 1263.788387]  [<ffffffffa06f4033>] ath9k_deinit_softc+0x113/0x120 [ath9k]
[ 1263.788392]  [<ffffffffa06f55cc>] ath9k_deinit_device+0x5c/0x70 [ath9k]
[ 1263.788397]  [<ffffffffa0704934>] ath_pci_remove+0x54/0xa0 [ath9k]
[ 1263.788401]  [<ffffffff81267d06>] pci_device_remove+0x46/0x110
[ 1263.788406]  [<ffffffff813102bc>] __device_release_driver+0x7c/0xe0
[ 1263.788410]  [<ffffffff81310a00>] driver_detach+0xd0/0xe0
[ 1263.788414]  [<ffffffff81310118>] bus_remove_driver+0x88/0xe0
[ 1263.788418]  [<ffffffff813111c2>] driver_unregister+0x62/0xa0
[ 1263.788421]  [<ffffffff812680c4>] pci_unregister_driver+0x44/0xc0
[ 1263.788427]  [<ffffffffa0705015>] ath_pci_exit+0x15/0x20 [ath9k]
[ 1263.788432]  [<ffffffffa070a92d>] ath9k_exit+0x15/0x31 [ath9k]
[ 1263.788436]  [<ffffffff810b971c>] sys_delete_module+0x18c/0x270
[ 1263.788441]  [<ffffffff81436edd>] ? retint_swapgs+0x13/0x1b
[ 1263.788446]  [<ffffffff812483be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 1263.788450]  [<ffffffff814378e9>] system_call_fastpath+0x16/0x1b
[ 1263.788453] ---[ end trace 3ab4d030ffde40d4 ]---

Signed-off-by: Sujith Manoharan <c_manoha@qca.qualcomm.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>

popcornmix pushed a commit that referenced this issue Oct 13, 2012

x86/smp: Fix topology checks on AMD MCM CPUs
The warning below triggers on AMD MCM packages because physical package
IDs on the cores of a _physical_ socket are the same. I.e., this field
says which CPUs belong to the same physical package.

However, the same two CPUs belong to two different internal, i.e.
"logical" nodes in the same physical socket which is reflected in the
CPU-to-node map on x86 with NUMA.

Which makes this check wrong on the above topologies so circumvent it.

[    0.444413] Booting Node   0, Processors  #1 #2 #3 #4 #5 Ok.
[    0.461388] ------------[ cut here ]------------
[    0.465997] WARNING: at arch/x86/kernel/smpboot.c:310 topology_sane.clone.1+0x6e/0x81()
[    0.473960] Hardware name: Dinar
[    0.477170] sched: CPU #6's mc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
[    0.486860] Booting Node   1, Processors  #6
[    0.491104] Modules linked in:
[    0.494141] Pid: 0, comm: swapper/6 Not tainted 3.4.0+ #1
[    0.499510] Call Trace:
[    0.501946]  [<ffffffff8144bf92>] ? topology_sane.clone.1+0x6e/0x81
[    0.508185]  [<ffffffff8102f1fc>] warn_slowpath_common+0x85/0x9d
[    0.514163]  [<ffffffff8102f2b7>] warn_slowpath_fmt+0x46/0x48
[    0.519881]  [<ffffffff8144bf92>] topology_sane.clone.1+0x6e/0x81
[    0.525943]  [<ffffffff8144c234>] set_cpu_sibling_map+0x251/0x371
[    0.532004]  [<ffffffff8144c4ee>] start_secondary+0x19a/0x218
[    0.537729] ---[ end trace 4eaa2a86a8e2da22 ]---
[    0.628197]  #7 #8 #9 #10 #11 Ok.
[    0.807108] Booting Node   3, Processors  #12 #13 #14 #15 #16 #17 Ok.
[    0.897587] Booting Node   2, Processors  #18 #19 #20 #21 #22 #23 Ok.
[    0.917443] Brought up 24 CPUs

We ran a topology sanity check test we have here on it and
it all looks ok... hopefully :).

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120529135442.GE29157@aftab.osrc.amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>

popcornmix pushed a commit that referenced this issue Oct 13, 2012

nfs: skip commit in releasepage if we're freeing memory for fs-relate…
…d reasons

We've had some reports of a deadlock where rpciod ends up with a stack
trace like this:

    PID: 2507   TASK: ffff88103691ab40  CPU: 14  COMMAND: "rpciod/14"
     #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9
     #1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs]
     #2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f
     #3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8
     #4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs]
     #5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs]
     #6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670
     #7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271
     #8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638
     #9 [ffff8810343bf818] shrink_zone at ffffffff8112788f
    #10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e
    #11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f
    #12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad
    #13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942
    #14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a
    #15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9
    #16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b
    #17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808
    #18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c
    #19 [ffff8810343bfce8] inet_create at ffffffff81483ba6
    #20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7
    #21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc]
    #22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc]
    #23 [ffff8810343bfe38] worker_thread at ffffffff810887d0
    #24 [ffff8810343bfee8] kthread at ffffffff8108dd96
    #25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca

rpciod is trying to allocate memory for a new socket to talk to the
server. The VM ends up calling ->releasepage to get more memory, and it
tries to do a blocking commit. That commit can't succeed however without
a connected socket, so we deadlock.

Fix this by setting PF_FSTRANS on the workqueue task prior to doing the
socket allocation, and having nfs_release_page check for that flag when
deciding whether to do a commit call. Also, set PF_FSTRANS
unconditionally in rpc_async_schedule since that function can also do
allocations sometimes.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org

popcornmix pushed a commit that referenced this issue Mar 28, 2013

cifs: ensure that cifs_get_root() only traverses directories
commit ce2ac52 upstream.

Kjell Braden reported this oops:

[  833.211970] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  833.212816] IP: [<          (null)>]           (null)
[  833.213280] PGD 1b9b2067 PUD e9f7067 PMD 0
[  833.213874] Oops: 0010 [#1] SMP
[  833.214344] CPU 0
[  833.214458] Modules linked in: des_generic md4 nls_utf8 cifs vboxvideo drm snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq bnep rfcomm snd_timer bluetooth snd_seq_device ppdev snd vboxguest parport_pc joydev mac_hid soundcore snd_page_alloc psmouse i2c_piix4 serio_raw lp parport usbhid hid e1000
[  833.215629]
[  833.215629] Pid: 1752, comm: mount.cifs Not tainted 3.0.0-rc7-bisectcifs-fec11dd9a0+ #18 innotek GmbH VirtualBox/VirtualBox
[  833.215629] RIP: 0010:[<0000000000000000>]  [<          (null)>]           (null)
[  833.215629] RSP: 0018:ffff8800119c9c50  EFLAGS: 00010282
[  833.215629] RAX: ffffffffa02186c0 RBX: ffff88000c427780 RCX: 0000000000000000
[  833.215629] RDX: 0000000000000000 RSI: ffff88000c427780 RDI: ffff88000c4362e8
[  833.215629] RBP: ffff8800119c9c88 R08: ffff88001fc15e30 R09: 00000000d69515c7
[  833.215629] R10: ffffffffa0201972 R11: ffff88000e8f6a28 R12: ffff88000c4362e8
[  833.215629] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88001181aaa6
[  833.215629] FS:  00007f2986171700(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000
[  833.215629] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  833.215629] CR2: 0000000000000000 CR3: 000000001b982000 CR4: 00000000000006f0
[  833.215629] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  833.215629] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  833.215629] Process mount.cifs (pid: 1752, threadinfo ffff8800119c8000, task ffff88001c1c16f0)
[  833.215629] Stack:
[  833.215629]  ffffffff8116a9b5 ffff8800119c9c88 ffffffff81178075 0000000000000286
[  833.215629]  0000000000000000 ffff88000c4276c0 ffff8800119c9ce8 ffff8800119c9cc8
[  833.215629]  ffffffff8116b06e ffff88001bc6fc00 ffff88000c4276c0 ffff88000c4276c0
[  833.215629] Call Trace:
[  833.215629]  [<ffffffff8116a9b5>] ? d_alloc_and_lookup+0x45/0x90
[  833.215629]  [<ffffffff81178075>] ? d_lookup+0x35/0x60
[  833.215629]  [<ffffffff8116b06e>] __lookup_hash.part.14+0x9e/0xc0
[  833.215629]  [<ffffffff8116b1d6>] lookup_one_len+0x146/0x1e0
[  833.215629]  [<ffffffff815e4f7e>] ? _raw_spin_lock+0xe/0x20
[  833.215629]  [<ffffffffa01eef0d>] cifs_do_mount+0x26d/0x500 [cifs]
[  833.215629]  [<ffffffff81163bd3>] mount_fs+0x43/0x1b0
[  833.215629]  [<ffffffff8117d41a>] vfs_kern_mount+0x6a/0xd0
[  833.215629]  [<ffffffff8117e584>] do_kern_mount+0x54/0x110
[  833.215629]  [<ffffffff8117fdc2>] do_mount+0x262/0x840
[  833.215629]  [<ffffffff81108a0e>] ? __get_free_pages+0xe/0x50
[  833.215629]  [<ffffffff8117f9ca>] ? copy_mount_options+0x3a/0x180
[  833.215629]  [<ffffffff8118075d>] sys_mount+0x8d/0xe0
[  833.215629]  [<ffffffff815ece82>] system_call_fastpath+0x16/0x1b
[  833.215629] Code:  Bad RIP value.
[  833.215629] RIP  [<          (null)>]           (null)
[  833.215629]  RSP <ffff8800119c9c50>
[  833.215629] CR2: 0000000000000000
[  833.238525] ---[ end trace ec00758b8d44f529 ]---

When walking down the path on the server, it's possible to hit a
symlink. The path walking code assumes that the caller will handle that
situation properly, but cifs_get_root() isn't set up for it. This patch
prevents the oops by simply returning an error.

A better solution would be to try and chase the symlinks here, but that's
fairly complicated to handle.

Fixes:

    https://bugzilla.kernel.org/show_bug.cgi?id=53221

Reported-and-tested-by: Kjell Braden <afflux@pentabarf.de>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

popcornmix pushed a commit that referenced this issue Aug 13, 2013

nfsd: nfsd_open: when dentry_open returns an error do not propagate a…
…s struct file

commit e4daf1f upstream.

The following call chain:
------------------------------------------------------------
nfs4_get_vfs_file
- nfsd_open
  - dentry_open
    - do_dentry_open
      - __get_file_write_access
        - get_write_access
          - return atomic_inc_unless_negative(&inode->i_writecount) ? 0 : -ETXTBSY;
------------------------------------------------------------

can result in the following state:
------------------------------------------------------------
struct nfs4_file {
...
  fi_fds = {0xffff880c1fa65c80, 0xffffffffffffffe6, 0x0},
  fi_access = {{
      counter = 0x1
    }, {
      counter = 0x0
    }},
...
------------------------------------------------------------

1) First time around, in nfs4_get_vfs_file() fp->fi_fds[O_WRONLY] is
NULL, hence nfsd_open() is called where we get status set to an error
and fp->fi_fds[O_WRONLY] to -ETXTBSY. Thus we do not reach
nfs4_file_get_access() and fi_access[O_WRONLY] is not incremented.

2) Second time around, in nfs4_get_vfs_file() fp->fi_fds[O_WRONLY] is
NOT NULL (-ETXTBSY), so nfsd_open() is NOT called, but
nfs4_file_get_access() IS called and fi_access[O_WRONLY] is incremented.
Thus we leave a landmine in the form of the nfs4_file data structure in
an incorrect state.

3) Eventually, when __nfs4_file_put_access() is called it finds
fi_access[O_WRONLY] being non-zero, it decrements it and calls
nfs4_file_put_fd() which tries to fput -ETXTBSY.
------------------------------------------------------------
...
     [exception RIP: fput+0x9]
     RIP: ffffffff81177fa9  RSP: ffff88062e365c90  RFLAGS: 00010282
     RAX: ffff880c2b3d99cc  RBX: ffff880c2b3d9978  RCX: 0000000000000002
     RDX: dead000000100101  RSI: 0000000000000001  RDI: ffffffffffffffe6
     RBP: ffff88062e365c90   R8: ffff88041fe797d8   R9: ffff88062e365d58
     R10: 0000000000000008  R11: 0000000000000000  R12: 0000000000000001
     R13: 0000000000000007  R14: 0000000000000000  R15: 0000000000000000
     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  #9 [ffff88062e365c98] __nfs4_file_put_access at ffffffffa0562334 [nfsd]
 #10 [ffff88062e365cc8] nfs4_file_put_access at ffffffffa05623ab [nfsd]
 #11 [ffff88062e365ce8] free_generic_stateid at ffffffffa056634d [nfsd]
 #12 [ffff88062e365d18] release_open_stateid at ffffffffa0566e4b [nfsd]
 #13 [ffff88062e365d38] nfsd4_close at ffffffffa0567401 [nfsd]
 #14 [ffff88062e365d88] nfsd4_proc_compound at ffffffffa0557f28 [nfsd]
 #15 [ffff88062e365dd8] nfsd_dispatch at ffffffffa054543e [nfsd]
 #16 [ffff88062e365e18] svc_process_common at ffffffffa04ba5a4 [sunrpc]
 #17 [ffff88062e365e98] svc_process at ffffffffa04babe0 [sunrpc]
 #18 [ffff88062e365eb8] nfsd at ffffffffa0545b62 [nfsd]
 #19 [ffff88062e365ee8] kthread at ffffffff81090886
 #20 [ffff88062e365f48] kernel_thread at ffffffff8100c14a
------------------------------------------------------------

Signed-off-by: Harshula Jayasuriya <harshula@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

popcornmix pushed a commit that referenced this issue Sep 4, 2013

nfsd: nfsd_open: when dentry_open returns an error do not propagate a…
…s struct file

The following call chain:
------------------------------------------------------------
nfs4_get_vfs_file
- nfsd_open
  - dentry_open
    - do_dentry_open
      - __get_file_write_access
        - get_write_access
          - return atomic_inc_unless_negative(&inode->i_writecount) ? 0 : -ETXTBSY;
------------------------------------------------------------

can result in the following state:
------------------------------------------------------------
struct nfs4_file {
...
  fi_fds = {0xffff880c1fa65c80, 0xffffffffffffffe6, 0x0},
  fi_access = {{
      counter = 0x1
    }, {
      counter = 0x0
    }},
...
------------------------------------------------------------

1) First time around, in nfs4_get_vfs_file() fp->fi_fds[O_WRONLY] is
NULL, hence nfsd_open() is called where we get status set to an error
and fp->fi_fds[O_WRONLY] to -ETXTBSY. Thus we do not reach
nfs4_file_get_access() and fi_access[O_WRONLY] is not incremented.

2) Second time around, in nfs4_get_vfs_file() fp->fi_fds[O_WRONLY] is
NOT NULL (-ETXTBSY), so nfsd_open() is NOT called, but
nfs4_file_get_access() IS called and fi_access[O_WRONLY] is incremented.
Thus we leave a landmine in the form of the nfs4_file data structure in
an incorrect state.

3) Eventually, when __nfs4_file_put_access() is called it finds
fi_access[O_WRONLY] being non-zero, it decrements it and calls
nfs4_file_put_fd() which tries to fput -ETXTBSY.
------------------------------------------------------------
...
     [exception RIP: fput+0x9]
     RIP: ffffffff81177fa9  RSP: ffff88062e365c90  RFLAGS: 00010282
     RAX: ffff880c2b3d99cc  RBX: ffff880c2b3d9978  RCX: 0000000000000002
     RDX: dead000000100101  RSI: 0000000000000001  RDI: ffffffffffffffe6
     RBP: ffff88062e365c90   R8: ffff88041fe797d8   R9: ffff88062e365d58
     R10: 0000000000000008  R11: 0000000000000000  R12: 0000000000000001
     R13: 0000000000000007  R14: 0000000000000000  R15: 0000000000000000
     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  #9 [ffff88062e365c98] __nfs4_file_put_access at ffffffffa0562334 [nfsd]
 #10 [ffff88062e365cc8] nfs4_file_put_access at ffffffffa05623ab [nfsd]
 #11 [ffff88062e365ce8] free_generic_stateid at ffffffffa056634d [nfsd]
 #12 [ffff88062e365d18] release_open_stateid at ffffffffa0566e4b [nfsd]
 #13 [ffff88062e365d38] nfsd4_close at ffffffffa0567401 [nfsd]
 #14 [ffff88062e365d88] nfsd4_proc_compound at ffffffffa0557f28 [nfsd]
 #15 [ffff88062e365dd8] nfsd_dispatch at ffffffffa054543e [nfsd]
 #16 [ffff88062e365e18] svc_process_common at ffffffffa04ba5a4 [sunrpc]
 #17 [ffff88062e365e98] svc_process at ffffffffa04babe0 [sunrpc]
 #18 [ffff88062e365eb8] nfsd at ffffffffa0545b62 [nfsd]
 #19 [ffff88062e365ee8] kthread at ffffffff81090886
 #20 [ffff88062e365f48] kernel_thread at ffffffff8100c14a
------------------------------------------------------------

Cc: stable@vger.kernel.org
Signed-off-by: Harshula Jayasuriya <harshula@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>

popcornmix pushed a commit that referenced this issue Sep 4, 2013

MIPS: oprofile: Fix BUG due to smp_processor_id() in preemptible code.
current_cpu_type() is not preemption-safe.
If CONFIG_PREEMPT is enabled then mipsxx_reg_setup() can be called from preemptible state.
Added get_cpu()/put_cpu() pair to make it preemption-safe.

This was found while testing oprofile with CONFIG_DEBUG_PREEMPT enable.

/usr/zntestsuite # opcontrol --init
/usr/zntestsuite # opcontrol --setup --event=L2_CACHE_ACCESSES:500 --event=L2_CACHE_MISSES:500 --no-vmlinux
/usr/zntestsuite # opcontrol --start
Using 2.6+ OProfile kernel interface.
BUG: using smp_processor_id() in preemptible [00000000] code: oprofiled/1362
caller is mipsxx_reg_setup+0x11c/0x164
CPU: 0 PID: 1362 Comm: oprofiled Not tainted 3.10.4 #18
Stack : 00000006 70757465 00000000 00000000 00000000 00000000 80b173f6 00000037
          80b10000 00000000 80b21614 88f5a220 00000000 00000000 00000000 00000000
          00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
          00000000 00000000 00000000 89c49c00 89c49c2c 80721254 807b7927 8012c1d0
          80b10000 80721254 00000000 00000552 88f5a220 80b1335c 807b78e6 89c49ba8
          ...
Call Trace:
[<801099a4>] show_stack+0x64/0x7c
[<80665520>] dump_stack+0x20/0x2c
[<803a2250>] debug_smp_processor_id+0xe0/0xf0
[<8052df24>] mipsxx_reg_setup+0x11c/0x164
[<8052cd70>] op_mips_setup+0x24/0x4c
[<80529cfc>] oprofile_setup+0x5c/0x12c
[<8052b9f8>] event_buffer_open+0x78/0xf8
[<801c3150>] do_dentry_open.isra.15+0x2b8/0x3b0
[<801c3270>] finish_open+0x28/0x4c
[<801d49b8>] do_last.isra.41+0x2cc/0xd00
[<801d54a0>] path_openat+0xb4/0x4c4
[<801d5c44>] do_filp_open+0x3c/0xac
[<801c4744>] do_sys_open+0x110/0x1f4
[<8010f47c>] stack_done+0x20/0x44

Bug reported and original patch by Jerin Jacob <jerinjacobk@gmail.com>.

Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Jerin Jacob <jerinjacobk@gmail.com>

popcornmix pushed a commit that referenced this issue Nov 4, 2013

x86/smpboot: Fix announce_cpu() to printk() the last "OK" properly
When booting secondary CPUs, announce_cpu() is called to show which cpu has
been brought up. For example:

[    0.402751] smpboot: Booting Node   0, Processors  #1 #2 #3 #4 #5 OK
[    0.525667] smpboot: Booting Node   1, Processors  #6 #7 #8 #9 #10 #11 OK
[    0.755592] smpboot: Booting Node   0, Processors  #12 #13 #14 #15 #16 #17 OK
[    0.890495] smpboot: Booting Node   1, Processors  #18 #19 #20 #21 #22 #23

But the last "OK" is lost, because 'nr_cpu_ids-1' represents the maximum
possible cpu id. It should use the maximum present cpu id in case not all
CPUs booted up.

Signed-off-by: Libin <huawei.libin@huawei.com>
Cc: <guohanjun@huawei.com>
Cc: <wangyijing@huawei.com>
Cc: <fenghua.yu@intel.com>
Cc: <paul.gortmaker@windriver.com>
Link: http://lkml.kernel.org/r/1378378676-18276-1-git-send-email-huawei.libin@huawei.com
[ tweaked the changelog, removed unnecessary line break, tweaked the format to align the fields vertically. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>

popcornmix pushed a commit that referenced this issue Nov 4, 2013

mm: memcg: fix test for child groups
When memcg code needs to know whether any given memcg has children, it
uses the cgroup child iteration primitives and returns true/false
depending on whether the iteration loop is executed at least once or
not.

Because a cgroup's list of children is RCU protected, these primitives
require the RCU read-lock to be held, which is not the case for all
memcg callers.  This results in the following splat when e.g.  enabling
hierarchy mode:

  WARNING: CPU: 3 PID: 1 at kernel/cgroup.c:3043 css_next_child+0xa3/0x160()
  CPU: 3 PID: 1 Comm: systemd Not tainted 3.12.0-rc5-00117-g83f11a9-dirty #18
  Hardware name: LENOVO 3680B56/3680B56, BIOS 6QET69WW (1.39 ) 04/26/2012
  Call Trace:
    dump_stack+0x54/0x74
    warn_slowpath_common+0x78/0xa0
    warn_slowpath_null+0x1a/0x20
    css_next_child+0xa3/0x160
    mem_cgroup_hierarchy_write+0x5b/0xa0
    cgroup_file_write+0x108/0x2a0
    vfs_write+0xbd/0x1e0
    SyS_write+0x4c/0xa0
    system_call_fastpath+0x16/0x1b

In the memcg case, we only care about children when we are attempting to
modify inheritable attributes interactively.  Racing with deletion could
mean a spurious -EBUSY, no problem.  Racing with addition is handled
just fine as well through the memcg_create_mutex: if the child group is
not on the list after the mutex is acquired, it won't be initialized
from the parent's attributes until after the unlock.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

popcornmix pushed a commit that referenced this issue Nov 30, 2013

block: fix a probe argument to blk_register_region
commit a207f59 upstream.

The probe function is supposed to return NULL on failure (as we can see in
kobj_lookup: kobj = probe(dev, index, data); ... if (kobj) return kobj;

However, in loop and brd, it returns negative error from ERR_PTR.

This causes a crash if we simulate disk allocation failure and run
less -f /dev/loop0 because the negative number is interpreted as a pointer:

BUG: unable to handle kernel NULL pointer dereference at 00000000000002b4
IP: [<ffffffff8118b188>] __blkdev_get+0x28/0x450
PGD 23c677067 PUD 23d6d1067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
Modules linked in: loop hpfs nvidia(PO) ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev msr ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_stats cpufreq_ondemand cpufreq_userspace cpufreq_powersave cpufreq_conservative hid_generic spadfs usbhid hid fuse raid0 snd_usb_audio snd_pcm_oss snd_mixer_oss md_mod snd_pcm snd_timer snd_page_alloc snd_hwdep snd_usbmidi_lib dmi_sysfs snd_rawmidi nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack snd soundcore lm85 hwmon_vid ohci_hcd ehci_pci ehci_hcd serverworks sata_svw libata acpi_cpufreq freq_table mperf ide_core usbcore kvm_amd kvm tg3 i2c_piix4 libphy microcode e100 usb_common ptp skge i2c_core pcspkr k10temp evdev floppy hwmon pps_core mii rtc_cmos button processor unix [last unloaded: nvidia]
CPU: 1 PID: 6831 Comm: less Tainted: P        W  O 3.10.15-devel #18
Hardware name: empty empty/S3992-E, BIOS 'V1.06   ' 06/09/2009
task: ffff880203cc6bc0 ti: ffff88023e47c000 task.ti: ffff88023e47c000
RIP: 0010:[<ffffffff8118b188>]  [<ffffffff8118b188>] __blkdev_get+0x28/0x450
RSP: 0018:ffff88023e47dbd8  EFLAGS: 00010286
RAX: ffffffffffffff74 RBX: ffffffffffffff74 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
RBP: ffff88023e47dc18 R08: 0000000000000002 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88023f519658
R13: ffffffff8118c300 R14: 0000000000000000 R15: ffff88023f519640
FS:  00007f2070bf7700(0000) GS:ffff880247400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000002b4 CR3: 000000023da1d000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
 0000000000000002 0000001d00000000 000000003e47dc50 ffff88023f519640
 ffff88043d5bb668 ffffffff8118c300 ffff88023d683550 ffff88023e47de60
 ffff88023e47dc98 ffffffff8118c10d 0000001d81605698 0000000000000292
Call Trace:
 [<ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60
 [<ffffffff8118c10d>] blkdev_get+0x1dd/0x370
 [<ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60
 [<ffffffff813cea6c>] ? _raw_spin_unlock+0x2c/0x50
 [<ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60
 [<ffffffff8118c365>] blkdev_open+0x65/0x80
 [<ffffffff8114d12e>] do_dentry_open.isra.18+0x23e/0x2f0
 [<ffffffff8114d214>] finish_open+0x34/0x50
 [<ffffffff8115e122>] do_last.isra.62+0x2d2/0xc50
 [<ffffffff8115eb58>] path_openat.isra.63+0xb8/0x4d0
 [<ffffffff81115a8e>] ? might_fault+0x4e/0xa0
 [<ffffffff8115f4f0>] do_filp_open+0x40/0x90
 [<ffffffff813cea6c>] ? _raw_spin_unlock+0x2c/0x50
 [<ffffffff8116db85>] ? __alloc_fd+0xa5/0x1f0
 [<ffffffff8114e45f>] do_sys_open+0xef/0x1d0
 [<ffffffff8114e559>] SyS_open+0x19/0x20
 [<ffffffff813cff16>] system_call_fastpath+0x1a/0x1f
Code: 44 00 00 55 48 89 e5 41 57 49 89 ff 41 56 41 89 d6 41 55 41 54 4c 8d 67 18 53 48 83 ec 18 89 75 cc e9 f2 00 00 00 0f 1f 44 00 00 <48> 8b 80 40 03 00 00 48 89 df 4c 8b 68 58 e8 d5
a4 07 00 44 89
RIP  [<ffffffff8118b188>] __blkdev_get+0x28/0x450
 RSP <ffff88023e47dbd8>
CR2: 00000000000002b4
---[ end trace bb7f32dbf02398dc ]---

The brd change should be backported to stable kernels starting with 2.6.25.
The loop change should be backported to stable kernels starting with 2.6.22.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

popcornmix pushed a commit that referenced this issue Nov 30, 2013

block: fix a probe argument to blk_register_region
commit a207f59 upstream.

The probe function is supposed to return NULL on failure (as we can see in
kobj_lookup: kobj = probe(dev, index, data); ... if (kobj) return kobj;

However, in loop and brd, it returns negative error from ERR_PTR.

This causes a crash if we simulate disk allocation failure and run
less -f /dev/loop0 because the negative number is interpreted as a pointer:

BUG: unable to handle kernel NULL pointer dereference at 00000000000002b4
IP: [<ffffffff8118b188>] __blkdev_get+0x28/0x450
PGD 23c677067 PUD 23d6d1067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
Modules linked in: loop hpfs nvidia(PO) ip6table_filter ip6_tables uvesafb cfbcopyarea cfbimgblt cfbfillrect fbcon font bitblit fbcon_rotate fbcon_cw fbcon_ud fbcon_ccw softcursor fb fbdev msr ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_state ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc tun ipv6 cpufreq_stats cpufreq_ondemand cpufreq_userspace cpufreq_powersave cpufreq_conservative hid_generic spadfs usbhid hid fuse raid0 snd_usb_audio snd_pcm_oss snd_mixer_oss md_mod snd_pcm snd_timer snd_page_alloc snd_hwdep snd_usbmidi_lib dmi_sysfs snd_rawmidi nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack snd soundcore lm85 hwmon_vid ohci_hcd ehci_pci ehci_hcd serverworks sata_svw libata acpi_cpufreq freq_table mperf ide_core usbcore kvm_amd kvm tg3 i2c_piix4 libphy microcode e100 usb_common ptp skge i2c_core pcspkr k10temp evdev floppy hwmon pps_core mii rtc_cmos button processor unix [last unloaded: nvidia]
CPU: 1 PID: 6831 Comm: less Tainted: P        W  O 3.10.15-devel #18
Hardware name: empty empty/S3992-E, BIOS 'V1.06   ' 06/09/2009
task: ffff880203cc6bc0 ti: ffff88023e47c000 task.ti: ffff88023e47c000
RIP: 0010:[<ffffffff8118b188>]  [<ffffffff8118b188>] __blkdev_get+0x28/0x450
RSP: 0018:ffff88023e47dbd8  EFLAGS: 00010286
RAX: ffffffffffffff74 RBX: ffffffffffffff74 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
RBP: ffff88023e47dc18 R08: 0000000000000002 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff88023f519658
R13: ffffffff8118c300 R14: 0000000000000000 R15: ffff88023f519640
FS:  00007f2070bf7700(0000) GS:ffff880247400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000002b4 CR3: 000000023da1d000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Stack:
 0000000000000002 0000001d00000000 000000003e47dc50 ffff88023f519640
 ffff88043d5bb668 ffffffff8118c300 ffff88023d683550 ffff88023e47de60
 ffff88023e47dc98 ffffffff8118c10d 0000001d81605698 0000000000000292
Call Trace:
 [<ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60
 [<ffffffff8118c10d>] blkdev_get+0x1dd/0x370
 [<ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60
 [<ffffffff813cea6c>] ? _raw_spin_unlock+0x2c/0x50
 [<ffffffff8118c300>] ? blkdev_get_by_dev+0x60/0x60
 [<ffffffff8118c365>] blkdev_open+0x65/0x80
 [<ffffffff8114d12e>] do_dentry_open.isra.18+0x23e/0x2f0
 [<ffffffff8114d214>] finish_open+0x34/0x50
 [<ffffffff8115e122>] do_last.isra.62+0x2d2/0xc50
 [<ffffffff8115eb58>] path_openat.isra.63+0xb8/0x4d0
 [<ffffffff81115a8e>] ? might_fault+0x4e/0xa0
 [<ffffffff8115f4f0>] do_filp_open+0x40/0x90
 [<ffffffff813cea6c>] ? _raw_spin_unlock+0x2c/0x50
 [<ffffffff8116db85>] ? __alloc_fd+0xa5/0x1f0
 [<ffffffff8114e45f>] do_sys_open+0xef/0x1d0
 [<ffffffff8114e559>] SyS_open+0x19/0x20
 [<ffffffff813cff16>] system_call_fastpath+0x1a/0x1f
Code: 44 00 00 55 48 89 e5 41 57 49 89 ff 41 56 41 89 d6 41 55 41 54 4c 8d 67 18 53 48 83 ec 18 89 75 cc e9 f2 00 00 00 0f 1f 44 00 00 <48> 8b 80 40 03 00 00 48 89 df 4c 8b 68 58 e8 d5
a4 07 00 44 89
RIP  [<ffffffff8118b188>] __blkdev_get+0x28/0x450
 RSP <ffff88023e47dbd8>
CR2: 00000000000002b4
---[ end trace bb7f32dbf02398dc ]---

The brd change should be backported to stable kernels starting with 2.6.25.
The loop change should be backported to stable kernels starting with 2.6.22.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

popcornmix pushed a commit that referenced this issue Aug 4, 2014

pwm: mxs: set pwm_chip can_sleep flag
The .config() calls clk_get_rate() which might sleep, so we need to set
pwm_chip can_sleep flag.  Otherwise, we see the following warning when
using PWM driven heartbeat led.

WARNING: CPU: 0 PID: 0 at kernel/locking/mutex.c:856 mutex_trylock+0x184/0x1a4()
DEBUG_LOCKS_WARN_ON(in_interrupt())
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 3.14.0-rc5 #18
[<c0015420>] (unwind_backtrace) from [<c0012cb0>] (show_stack+0x10/0x14)
[<c0012cb0>] (show_stack) from [<c001daf8>] (warn_slowpath_common+0x6c/0x8c)
[<c001daf8>] (warn_slowpath_common) from [<c001dbac>] (warn_slowpath_fmt+0x30/0x40)
[<c001dbac>] (warn_slowpath_fmt) from [<c045df74>] (mutex_trylock+0x184/0x1a4)
[<c045df74>] (mutex_trylock) from [<c0360950>] (clk_prepare_lock+0xc/0xec)
[<c0360950>] (clk_prepare_lock) from [<c0362020>] (clk_get_rate+0xc/0x68)
[<c0362020>] (clk_get_rate) from [<c028d07c>] (mxs_pwm_config+0x20/0x198)
[<c028d07c>] (mxs_pwm_config) from [<c028bde8>] (pwm_config+0x60/0x70)
[<c028bde8>] (pwm_config) from [<c034b61c>] (__led_pwm_set+0x1c/0x3c)
[<c034b61c>] (__led_pwm_set) from [<c034bc3c>] (led_heartbeat_function+0x70/0x110)
[<c034bc3c>] (led_heartbeat_function) from [<c00292f0>] (call_timer_fn+0x7c/0x164)
[<c00292f0>] (call_timer_fn) from [<c00295c8>] (run_timer_softirq+0x1f0/0x260)
[<c00295c8>] (run_timer_softirq) from [<c002255c>] (__do_softirq+0xc4/0x2f0)
[<c002255c>] (__do_softirq) from [<c0022890>] (irq_exit+0xa4/0x10c)
[<c0022890>] (irq_exit) from [<c0010240>] (handle_IRQ+0x34/0x84)
[<c0010240>] (handle_IRQ) from [<c0013524>] (__irq_svc+0x44/0x54)
[<c0013524>] (__irq_svc) from [<c00107f8>] (arch_cpu_idle+0x40/0x48)
[<c00107f8>] (arch_cpu_idle) from [<c005deb8>] (cpu_startup_entry+0x70/0x198)
[<c005deb8>] (cpu_startup_entry) from [<c060aac8>] (start_kernel+0x2a8/0x2f8)

Reported-by: Stefan Wahren <stefan.wahren@i2se.com>
Signed-off-by: Shawn Guo <shawn.guo@freescale.com>
Acked-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Tested-by: Stefan Wahren <stefan.wahren@i2se.com>
Signed-off-by: Thierry Reding <thierry.reding@gmail.com>

popcornmix pushed a commit that referenced this issue Aug 4, 2014

ocfs2: refcount: take rw_lock in ocfs2_reflink
This patch tries to fix this crash:

 #5 [ffff88003c1cd690] do_invalid_op at ffffffff810166d5
 #6 [ffff88003c1cd730] invalid_op at ffffffff8159b2de
    [exception RIP: ocfs2_direct_IO_get_blocks+359]
    RIP: ffffffffa05dfa27  RSP: ffff88003c1cd7e8  RFLAGS: 00010202
    RAX: 0000000000000000  RBX: ffff88003c1cdaa8  RCX: 0000000000000000
    RDX: 000000000000000c  RSI: ffff880027a95000  RDI: ffff88003c79b540
    RBP: ffff88003c1cd858   R8: 0000000000000000   R9: ffffffff815f6ba0
    R10: 00000000000001c9  R11: 00000000000001c9  R12: ffff88002d271500
    R13: 0000000000000001  R14: 0000000000000000  R15: 0000000000001000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffff88003c1cd860] do_direct_IO at ffffffff811cd31b
 #8 [ffff88003c1cd950] direct_IO_iovec at ffffffff811cde9c
 #9 [ffff88003c1cd9b0] do_blockdev_direct_IO at ffffffff811ce764
#10 [ffff88003c1cdb80] __blockdev_direct_IO at ffffffff811ce7cc
#11 [ffff88003c1cdbb0] ocfs2_direct_IO at ffffffffa05df756 [ocfs2]
#12 [ffff88003c1cdbe0] generic_file_direct_write_iter at ffffffff8112f935
#13 [ffff88003c1cdc40] ocfs2_file_write_iter at ffffffffa0600ccc [ocfs2]
#14 [ffff88003c1cdd50] do_aio_write at ffffffff8119126c
#15 [ffff88003c1cddc0] aio_rw_vect_retry at ffffffff811d9bb4
#16 [ffff88003c1cddf0] aio_run_iocb at ffffffff811db880
#17 [ffff88003c1cde30] io_submit_one at ffffffff811dc238
#18 [ffff88003c1cde80] do_io_submit at ffffffff811dc437
#19 [ffff88003c1cdf70] sys_io_submit at ffffffff811dc530
#20 [ffff88003c1cdf80] system_call_fastpath at ffffffff8159a159

It crashes at
        BUG_ON(create && (ext_flags & OCFS2_EXT_REFCOUNTED));
in ocfs2_direct_IO_get_blocks.

ocfs2_direct_IO_get_blocks is expecting the OCFS2_EXT_REFCOUNTED be removed in
ocfs2_prepare_inode_for_write() if it was there. But no cluster lock is taken
during the time before (or inside) ocfs2_prepare_inode_for_write() and after
ocfs2_direct_IO_get_blocks().

It can happen in this case:

Node A(which crashes)				Node B
------------------------                 ---------------------------
ocfs2_file_aio_write
  ocfs2_prepare_inode_for_write
    ocfs2_inode_lock
    ...
    ocfs2_inode_unlock
  #no refcount found
....					ocfs2_reflink
                                          ocfs2_inode_lock
                                          ...
                                          ocfs2_inode_unlock
                                          #now, refcount flag set on extent

                                        ...
                                        flush change to disk

ocfs2_direct_IO_get_blocks
  ocfs2_get_clusters
    #extent map miss
    #buffer_head miss
    read extents from disk
  found refcount flag on extent
  crash..

Fix:
Take rw_lock in ocfs2_reflink path

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

popcornmix pushed a commit that referenced this issue Aug 4, 2014

powerpc: bpf: Fix the broken LD_VLAN_TAG_PRESENT test
We have to return the boolean here if the tag presents
or not, not just ANDing the TCI with the mask which results to:

[  709.412097] test_bpf: #18 LD_VLAN_TAG_PRESENT
[  709.412245] ret 4096 != 1
[  709.412332] ret 4096 != 1
[  709.412333] FAIL (2 times)

Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

davet321 pushed a commit to davet321/rpi-linux that referenced this issue Aug 17, 2015

mm, vmscan: Do not wait for page writeback for GFP_NOFS allocations
commit ecf5fc6 upstream.

Nikolay has reported a hang when a memcg reclaim got stuck with the
following backtrace:

PID: 18308  TASK: ffff883d7c9b0a30  CPU: 1   COMMAND: "rsync"
  #0 __schedule at ffffffff815ab152
  #1 schedule at ffffffff815ab76e
  #2 schedule_timeout at ffffffff815ae5e5
  #3 io_schedule_timeout at ffffffff815aad6a
  #4 bit_wait_io at ffffffff815abfc6
  #5 __wait_on_bit at ffffffff815abda5
  #6 wait_on_page_bit at ffffffff8111fd4f
  #7 shrink_page_list at ffffffff81135445
  #8 shrink_inactive_list at ffffffff81135845
  #9 shrink_lruvec at ffffffff81135ead
 #10 shrink_zone at ffffffff811360c3
 #11 shrink_zones at ffffffff81136eff
 #12 do_try_to_free_pages at ffffffff8113712f
 #13 try_to_free_mem_cgroup_pages at ffffffff811372be
 #14 try_charge at ffffffff81189423
 #15 mem_cgroup_try_charge at ffffffff8118c6f5
 #16 __add_to_page_cache_locked at ffffffff8112137d
 #17 add_to_page_cache_lru at ffffffff81121618
 #18 pagecache_get_page at ffffffff8112170b
 #19 grow_dev_page at ffffffff811c8297
 #20 __getblk_slow at ffffffff811c91d6
 #21 __getblk_gfp at ffffffff811c92c1
 #22 ext4_ext_grow_indepth at ffffffff8124565c
 #23 ext4_ext_create_new_leaf at ffffffff81246ca8
 #24 ext4_ext_insert_extent at ffffffff81246f09
 #25 ext4_ext_map_blocks at ffffffff8124a848
 #26 ext4_map_blocks at ffffffff8121a5b7
 #27 mpage_map_one_extent at ffffffff8121b1fa
 #28 mpage_map_and_submit_extent at ffffffff8121f07b
 #29 ext4_writepages at ffffffff8121f6d5
 #30 do_writepages at ffffffff8112c490
 #31 __filemap_fdatawrite_range at ffffffff81120199
 #32 filemap_flush at ffffffff8112041c
 #33 ext4_alloc_da_blocks at ffffffff81219da1
 #34 ext4_rename at ffffffff81229b91
 #35 ext4_rename2 at ffffffff81229e32
 #36 vfs_rename at ffffffff811a08a5
 #37 SYSC_renameat2 at ffffffff811a3ffc
 #38 sys_renameat2 at ffffffff811a408e
 #39 sys_rename at ffffffff8119e51e
 #40 system_call_fastpath at ffffffff815afa89

Dave Chinner has properly pointed out that this is a deadlock in the
reclaim code because ext4 doesn't submit pages which are marked by
PG_writeback right away.

The heuristic was introduced by commit e62e384 ("memcg: prevent OOM
with too many dirty pages") and it was applied only when may_enter_fs
was specified.  The code has been changed by c3b94f4 ("memcg:
further prevent OOM with too many dirty pages") which has removed the
__GFP_FS restriction with a reasoning that we do not get into the fs
code.  But this is not sufficient apparently because the fs doesn't
necessarily submit pages marked PG_writeback for IO right away.

ext4_bio_write_page calls io_submit_add_bh but that doesn't necessarily
submit the bio.  Instead it tries to map more pages into the bio and
mpage_map_one_extent might trigger memcg charge which might end up
waiting on a page which is marked PG_writeback but hasn't been submitted
yet so we would end up waiting for something that never finishes.

Fix this issue by replacing __GFP_IO by may_enter_fs check (for case 2)
before we go to wait on the writeback.  The page fault path, which is
the only path that triggers memcg oom killer since 3.12, shouldn't
require GFP_NOFS and so we shouldn't reintroduce the premature OOM
killer issue which was originally addressed by the heuristic.

As per David Chinner the xfs is doing similar thing since 2.6.15 already
so ext4 is not the only affected filesystem.  Moreover he notes:

: For example: IO completion might require unwritten extent conversion
: which executes filesystem transactions and GFP_NOFS allocations. The
: writeback flag on the pages can not be cleared until unwritten
: extent conversion completes. Hence memory reclaim cannot wait on
: page writeback to complete in GFP_NOFS context because it is not
: safe to do so, memcg reclaim or otherwise.

Cc: stable@vger.kernel.org # 3.9+
[tytso@mit.edu: corrected the control flow]
Fixes: c3b94f4 ("memcg: further prevent OOM with too many dirty pages")
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

popcornmix pushed a commit that referenced this issue Aug 20, 2015

mm, vmscan: Do not wait for page writeback for GFP_NOFS allocations
Nikolay has reported a hang when a memcg reclaim got stuck with the
following backtrace:

PID: 18308  TASK: ffff883d7c9b0a30  CPU: 1   COMMAND: "rsync"
  #0 __schedule at ffffffff815ab152
  #1 schedule at ffffffff815ab76e
  #2 schedule_timeout at ffffffff815ae5e5
  #3 io_schedule_timeout at ffffffff815aad6a
  #4 bit_wait_io at ffffffff815abfc6
  #5 __wait_on_bit at ffffffff815abda5
  #6 wait_on_page_bit at ffffffff8111fd4f
  #7 shrink_page_list at ffffffff81135445
  #8 shrink_inactive_list at ffffffff81135845
  #9 shrink_lruvec at ffffffff81135ead
 #10 shrink_zone at ffffffff811360c3
 #11 shrink_zones at ffffffff81136eff
 #12 do_try_to_free_pages at ffffffff8113712f
 #13 try_to_free_mem_cgroup_pages at ffffffff811372be
 #14 try_charge at ffffffff81189423
 #15 mem_cgroup_try_charge at ffffffff8118c6f5
 #16 __add_to_page_cache_locked at ffffffff8112137d
 #17 add_to_page_cache_lru at ffffffff81121618
 #18 pagecache_get_page at ffffffff8112170b
 #19 grow_dev_page at ffffffff811c8297
 #20 __getblk_slow at ffffffff811c91d6
 #21 __getblk_gfp at ffffffff811c92c1
 #22 ext4_ext_grow_indepth at ffffffff8124565c
 #23 ext4_ext_create_new_leaf at ffffffff81246ca8
 #24 ext4_ext_insert_extent at ffffffff81246f09
 #25 ext4_ext_map_blocks at ffffffff8124a848
 #26 ext4_map_blocks at ffffffff8121a5b7
 #27 mpage_map_one_extent at ffffffff8121b1fa
 #28 mpage_map_and_submit_extent at ffffffff8121f07b
 #29 ext4_writepages at ffffffff8121f6d5
 #30 do_writepages at ffffffff8112c490
 #31 __filemap_fdatawrite_range at ffffffff81120199
 #32 filemap_flush at ffffffff8112041c
 #33 ext4_alloc_da_blocks at ffffffff81219da1
 #34 ext4_rename at ffffffff81229b91
 #35 ext4_rename2 at ffffffff81229e32
 #36 vfs_rename at ffffffff811a08a5
 #37 SYSC_renameat2 at ffffffff811a3ffc
 #38 sys_renameat2 at ffffffff811a408e
 #39 sys_rename at ffffffff8119e51e
 #40 system_call_fastpath at ffffffff815afa89

Dave Chinner has properly pointed out that this is a deadlock in the
reclaim code because ext4 doesn't submit pages which are marked by
PG_writeback right away.

The heuristic was introduced by commit e62e384 ("memcg: prevent OOM
with too many dirty pages") and it was applied only when may_enter_fs
was specified.  The code has been changed by c3b94f4 ("memcg:
further prevent OOM with too many dirty pages") which has removed the
__GFP_FS restriction with a reasoning that we do not get into the fs
code.  But this is not sufficient apparently because the fs doesn't
necessarily submit pages marked PG_writeback for IO right away.

ext4_bio_write_page calls io_submit_add_bh but that doesn't necessarily
submit the bio.  Instead it tries to map more pages into the bio and
mpage_map_one_extent might trigger memcg charge which might end up
waiting on a page which is marked PG_writeback but hasn't been submitted
yet so we would end up waiting for something that never finishes.

Fix this issue by replacing __GFP_IO by may_enter_fs check (for case 2)
before we go to wait on the writeback.  The page fault path, which is
the only path that triggers memcg oom killer since 3.12, shouldn't
require GFP_NOFS and so we shouldn't reintroduce the premature OOM
killer issue which was originally addressed by the heuristic.

As per David Chinner the xfs is doing similar thing since 2.6.15 already
so ext4 is not the only affected filesystem.  Moreover he notes:

: For example: IO completion might require unwritten extent conversion
: which executes filesystem transactions and GFP_NOFS allocations. The
: writeback flag on the pages can not be cleared until unwritten
: extent conversion completes. Hence memory reclaim cannot wait on
: page writeback to complete in GFP_NOFS context because it is not
: safe to do so, memcg reclaim or otherwise.

Cc: stable@vger.kernel.org # 3.9+
[tytso@mit.edu: corrected the control flow]
Fixes: c3b94f4 ("memcg: further prevent OOM with too many dirty pages")
Reported-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

anholt referenced this issue in anholt/linux Dec 22, 2015

xfs: give all workqueues rescuer threads
We're consistently hitting deadlocks here with XFS on recent kernels.
After some digging through the crash files, it looks like everyone in
the system is waiting for XFS to reclaim memory.

Something like this:

PID: 2733434  TASK: ffff8808cd242800  CPU: 19  COMMAND: "java"
 #0 [ffff880019c53588] __schedule at ffffffff818c4df2
 #1 [ffff880019c535d8] schedule at ffffffff818c5517
 #2 [ffff880019c535f8] _xfs_log_force_lsn at ffffffff81316348
 #3 [ffff880019c53688] xfs_log_force_lsn at ffffffff813164fb
 #4 [ffff880019c536b8] xfs_iunpin_wait at ffffffff8130835e
 #5 [ffff880019c53728] xfs_reclaim_inode at ffffffff812fd453
 #6 [ffff880019c53778] xfs_reclaim_inodes_ag at ffffffff812fd8c7
 #7 [ffff880019c53928] xfs_reclaim_inodes_nr at ffffffff812fe433
 #8 [ffff880019c53958] xfs_fs_free_cached_objects at ffffffff8130d3b9
 #9 [ffff880019c53968] super_cache_scan at ffffffff811a6f73
#10 [ffff880019c539c8] shrink_slab at ffffffff811460e6
#11 [ffff880019c53aa8] shrink_zone at ffffffff8114a53f
#12 [ffff880019c53b48] do_try_to_free_pages at ffffffff8114a8ba
#13 [ffff880019c53be8] try_to_free_pages at ffffffff8114ad5a
#14 [ffff880019c53c78] __alloc_pages_nodemask at ffffffff8113e1b8
#15 [ffff880019c53d88] alloc_kmem_pages_node at ffffffff8113e671
#16 [ffff880019c53dd8] copy_process at ffffffff8104f781
#17 [ffff880019c53ec8] do_fork at ffffffff8105129c
#18 [ffff880019c53f38] sys_clone at ffffffff810515b6
#19 [ffff880019c53f48] stub_clone at ffffffff818c8e4d

xfs_log_force_lsn is waiting for logs to get cleaned, which is waiting
for IO, which is waiting for workers to complete the IO which is waiting
for worker threads that don't exist yet:

PID: 2752451  TASK: ffff880bd6bdda00  CPU: 37  COMMAND: "kworker/37:1"
 #0 [ffff8808d20abbb0] __schedule at ffffffff818c4df2
 #1 [ffff8808d20abc00] schedule at ffffffff818c5517
 #2 [ffff8808d20abc20] schedule_timeout at ffffffff818c7c6c
 #3 [ffff8808d20abcc0] wait_for_completion_killable at ffffffff818c6495
 #4 [ffff8808d20abd30] kthread_create_on_node at ffffffff8106ec82
 #5 [ffff8808d20abdf0] create_worker at ffffffff8106752f
 #6 [ffff8808d20abe40] worker_thread at ffffffff810699be
 #7 [ffff8808d20abec0] kthread at ffffffff8106ef59
 #8 [ffff8808d20abf50] ret_from_fork at ffffffff818c8ac8

I think we should be using WQ_MEM_RECLAIM to make sure this thread
pool makes progress when we're not able to allocate new workers.

[dchinner: make all workqueues WQ_MEM_RECLAIM]

Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>

popcornmix pushed a commit that referenced this issue Apr 18, 2016

drm/i915: Exit cherryview_irq_handler() after one pass
This effectively reverts

commit 8e5fd59
Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
Date:   Wed Apr 9 13:28:50 2014 +0300

    drm/i915/chv: Make CHV irq handler loop until all interrupts are consumed

as under continuous execlists load we can saturate the IRQ handler,
destablising the tsc clock and triggering the NMI watchdog to declare a hung
CPU.

[  552.756051] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:
[  552.756080] clocksource:                       'refined-jiffies' wd_now: 10003b480 wd_last: 10003b28c mask: ffffffff
[  552.756091] clocksource:                       'tsc' cs_now: d55d31aa50 cs_last: d17446166c mask: ffffffffffffffff
[  552.756210] clocksource: Switched to clocksource refined-jiffies
[  575.217870] NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
[  575.217893] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.5.0-rc7+ #18
[  575.217905] Hardware name:                  /NUC5CPYB, BIOS PYBSWCEL.86A.0027.2015.0507.1758 05/07/2015
[  575.217915]  0000000000000000 ffff88027fd05bc0 ffffffff81288c6d 0000000000000000
[  575.217935]  0000000000000001 ffff88027fd05be0 ffffffff810e72d1 0000000000000000
[  575.217951]  ffff88027fd05c80 ffff88027fd05c20 ffffffff81114b60 0000000181015f1e
[  575.217967] Call Trace:
[  575.217973]  <NMI>  [<ffffffff81288c6d>] dump_stack+0x4f/0x72
[  575.217994]  [<ffffffff810e72d1>] watchdog_overflow_callback+0x151/0x160
[  575.218003]  [<ffffffff81114b60>] __perf_event_overflow+0xa0/0x1e0
[  575.218016]  [<ffffffff811154c4>] perf_event_overflow+0x14/0x20
[  575.218028]  [<ffffffff8101d2ca>] intel_pmu_handle_irq+0x1da/0x460
[  575.218042]  [<ffffffff814a8aae>] ? poll_idle+0x3e/0x70
[  575.218052]  [<ffffffff814a8aae>] ? poll_idle+0x3e/0x70
[  575.218064]  [<ffffffff81014ae8>] perf_event_nmi_handler+0x28/0x50
[  575.218075]  [<ffffffff81007540>] nmi_handle+0x60/0x130
[  575.218086]  [<ffffffff814a8aae>] ? poll_idle+0x3e/0x70
[  575.218096]  [<ffffffff810079c0>] do_nmi+0x140/0x470
[  575.218108]  [<ffffffff81559ec7>] end_repeat_nmi+0x1a/0x1e
[  575.218119]  [<ffffffff814a8aae>] ? poll_idle+0x3e/0x70
[  575.218129]  [<ffffffff814a8aae>] ? poll_idle+0x3e/0x70
[  575.218139]  [<ffffffff814a8aae>] ? poll_idle+0x3e/0x70
[  575.218148]  <<EOE>>  [<ffffffff814a8353>] cpuidle_enter_state+0xf3/0x2f0
[  575.218164]  [<ffffffff814a8587>] cpuidle_enter+0x17/0x20
[  575.218175]  [<ffffffff810aaa3a>] call_cpuidle+0x2a/0x40
[  575.218185]  [<ffffffff810aade3>] cpu_startup_entry+0x273/0x330
[  575.218196]  [<ffffffff81033a1e>] start_secondary+0x10e/0x130

However, not servicing all available IIR within the handler does hurt the
throughput of pathological nop execbuf by about 20%, with a similar effect
upon the dispatch latency of a series of execbuf.

v2: use do {} while(0) for a smaller patch, and easier to revert again

I have reasonable confidence that we do not miss GT interrupts (as
execlists provides a stress case with a failure mechanism easily
detected by igt), however I have less confidence about all the other
sources of interrupts and worry that may lose a display hotplug
interrupt, for example.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=93467
Testcase: igt/gem_exec_nop/basic # requires NMI watchdog
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Antti Koskipää <antti.koskipaa@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: stable@vger.kernel.org
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/1457946117-6714-1-git-send-email-chris@chris-wilson.co.uk
(cherry picked from commit 579de73)
Signed-off-by: Jani Nikula <jani.nikula@intel.com>

popcornmix pushed a commit that referenced this issue Jul 4, 2016

power_supply: tps65217-charger: Fix NULL deref during property export
This bug leads to:

[    1.906411] Unable to handle kernel NULL pointer dereference at virtual address 0000000c
[    1.914878] pgd = c0004000
[    1.917786] [0000000c] *pgd=00000000
[    1.921536] Internal error: Oops: 5 [#1] SMP ARM
[    1.926357] Modules linked in:
[    1.929556] CPU: 0 PID: 14 Comm: kworker/0:1 Not tainted 4.4.5 #18
[    1.936006] Hardware name: Generic AM33XX (Flattened Device Tree)
[    1.942383] Workqueue: events power_supply_changed_work
[    1.947842] task: de2c41c0 ti: de2c8000 task.ti: de2c8000
[    1.953483] PC is at tps65217_ac_get_property+0x14/0x28
[    1.958937] LR is at tps65217_ac_get_property+0x10/0x28

Driver was trying to use drv_data in property get handler. However drv_data
was not set, so it caused NULL pointer dereference. This patch properly
sets drv_data during probe by power_supply_config parameter, so the
property get handler works as desired.

Signed-off-by: Marcin Niestroj <m.niestroj@grinn-global.com>
Fixes: 3636859 ("power_supply: Add support for tps65217-charger")
Signed-off-by: Sebastian Reichel <sre@kernel.org>

popcornmix pushed a commit that referenced this issue Jul 25, 2016

PKCS#7: Fix panic when referring to the empty AKID when DEBUG defined
This fix resolves the following kernel panic if an empty or missing
AuthorityKeyIdentifier is encountered and DEBUG is defined in
pkcs7_verify.c.

[  459.041989] PKEY: <==public_key_verify_signature() = 0
[  459.041993] PKCS7: Verified signature 1
[  459.041995] PKCS7: ==> pkcs7_verify_sig_chain()
[  459.041999] PKCS7: verify Sample DB Certificate for SCP: 01
[  459.042002] PKCS7: - issuer Sample KEK Certificate for SCP
[  459.042014] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  459.042135] IP: [<ffffffff813e7b4c>] pkcs7_verify+0x72c/0x7f0
[  459.042217] PGD 739e6067 PUD 77719067 PMD 0
[  459.042286] Oops: 0000 [#1] PREEMPT SMP
[  459.042328] Modules linked in:
[  459.042368] CPU: 0 PID: 474 Comm: kexec Not tainted 4.7.0-rc7-WR8.0.0.0_standard+ #18
[  459.042462] Hardware name: To be filled by O.E.M. To be filled by O.E.M./Aptio CRB, BIOS 5.6.5 10/09/2014
[  459.042586] task: ffff880073a50000 ti: ffff8800738e8000 task.ti: ffff8800738e8000
[  459.042675] RIP: 0010:[<ffffffff813e7b4c>]  [<ffffffff813e7b4c>] pkcs7_verify+0x72c/0x7f0
[  459.042784] RSP: 0018:ffff8800738ebd58  EFLAGS: 00010246
[  459.042845] RAX: 0000000000000000 RBX: ffff880076b7da80 RCX: 0000000000000006
[  459.042929] RDX: 0000000000000001 RSI: ffffffff81c85001 RDI: ffffffff81ca00a9
[  459.043014] RBP: ffff8800738ebd98 R08: 0000000000000400 R09: ffff8800788a304c
[  459.043098] R10: 0000000000000000 R11: 00000000000060ca R12: ffff8800769a2bc0
[  459.043182] R13: ffff880077358300 R14: 0000000000000000 R15: ffff8800769a2dc0
[  459.043268] FS:  00007f24cc741700(0000) GS:ffff880074e00000(0000) knlGS:0000000000000000
[  459.043365] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  459.043431] CR2: 0000000000000000 CR3: 0000000073a36000 CR4: 00000000001006f0
[  459.043514] Stack:
[  459.043530]  0000000000000000 ffffffbf00000020 31ffffff813e68b0 0000000000000002
[  459.043644]  ffff8800769a2bc0 0000000000000000 00000000007197b8 0000000000000002
[  459.043756]  ffff8800738ebdd8 ffffffff81153fb1 0000000000000000 0000000000000000
[  459.043869] Call Trace:
[  459.043898]  [<ffffffff81153fb1>] verify_pkcs7_signature+0x61/0x140
[  459.043974]  [<ffffffff813e7f0b>] verify_pefile_signature+0x2cb/0x830
[  459.044052]  [<ffffffff813e8470>] ? verify_pefile_signature+0x830/0x830
[  459.044134]  [<ffffffff81048e25>] bzImage64_verify_sig+0x15/0x20
[  459.046332]  [<ffffffff81046e09>] arch_kexec_kernel_verify_sig+0x29/0x40
[  459.048552]  [<ffffffff810f10e4>] SyS_kexec_file_load+0x1f4/0x6c0
[  459.050768]  [<ffffffff81050e36>] ? __do_page_fault+0x1b6/0x550
[  459.052996]  [<ffffffff8199241f>] entry_SYSCALL_64_fastpath+0x17/0x93
[  459.055242] Code: e8 0a d6 ff ff 85 c0 0f 88 7a fb ff ff 4d 39 fd 4d 89 7d 08 74 45 4d 89 fd e9 14 fe ff ff 4d 8b 76 08 31 c0 48 c7 c7 a9 00 ca 81 <41> 0f b7 36 49 8d 56 02 e8 d0 91 d6 ff 4d 8b 3c 24 4d 85 ff 0f
[  459.060535] RIP  [<ffffffff813e7b4c>] pkcs7_verify+0x72c/0x7f0
[  459.063040]  RSP <ffff8800738ebd58>
[  459.065456] CR2: 0000000000000000
[  459.075998] ---[ end trace c15f0e897cda28dc ]---

Signed-off-by: Lans Zhang <jia.zhang@windriver.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
cc: linux-crypto@vger.kernel.org
cc: kexec@lists.infradead.org
Signed-off-by: James Morris <james.l.morris@oracle.com>

davet321 pushed a commit to davet321/rpi-linux that referenced this issue Oct 28, 2016

IB/core: correctly handle rdma_rw_init_mrs() failure
commit b6bc1c7 upstream.

Function ib_create_qp() was failing to return an error when
rdma_rw_init_mrs() fails, causing a crash further down in ib_create_qp()
when trying to dereferece the qp pointer which was actually a negative
errno.

The crash:

crash> log|grep BUG
[  136.458121] BUG: unable to handle kernel NULL pointer dereference at 0000000000000098
crash> bt
PID: 3736   TASK: ffff8808543215c0  CPU: 2   COMMAND: "kworker/u64:2"
 #0 [ffff88084d323340] machine_kexec at ffffffff8105fbb0
 #1 [ffff88084d3233b0] __crash_kexec at ffffffff81116758
 #2 [ffff88084d323480] crash_kexec at ffffffff8111682d
 #3 [ffff88084d3234b0] oops_end at ffffffff81032bd6
 #4 [ffff88084d3234e0] no_context at ffffffff8106e431
 #5 [ffff88084d323530] __bad_area_nosemaphore at ffffffff8106e610
 #6 [ffff88084d323590] bad_area_nosemaphore at ffffffff8106e6f4
 #7 [ffff88084d3235a0] __do_page_fault at ffffffff8106ebdc
 #8 [ffff88084d323620] do_page_fault at ffffffff8106f057
 #9 [ffff88084d323660] page_fault at ffffffff816e3148
    [exception RIP: ib_create_qp+427]
    RIP: ffffffffa02554fb  RSP: ffff88084d323718  RFLAGS: 00010246
    RAX: 0000000000000004  RBX: fffffffffffffff4  RCX: 000000018020001f
    RDX: ffff880830997fc0  RSI: 0000000000000001  RDI: ffff88085f407200
    RBP: ffff88084d323778   R8: 0000000000000001   R9: ffffea0020bae210
    R10: ffffea0020bae218  R11: 0000000000000001  R12: ffff88084d3237c8
    R13: 00000000fffffff4  R14: ffff880859fa5000  R15: ffff88082eb89800
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#10 [ffff88084d323780] rdma_create_qp at ffffffffa0782681 [rdma_cm]
#11 [ffff88084d3237b0] nvmet_rdma_create_queue_ib at ffffffffa07c43f3 [nvmet_rdma]
#12 [ffff88084d323860] nvmet_rdma_alloc_queue at ffffffffa07c5ba9 [nvmet_rdma]
#13 [ffff88084d323900] nvmet_rdma_queue_connect at ffffffffa07c5c96 [nvmet_rdma]
#14 [ffff88084d323980] nvmet_rdma_cm_handler at ffffffffa07c6450 [nvmet_rdma]
#15 [ffff88084d3239b0] iw_conn_req_handler at ffffffffa0787480 [rdma_cm]
#16 [ffff88084d323a60] cm_conn_req_handler at ffffffffa0775f06 [iw_cm]
#17 [ffff88084d323ab0] process_event at ffffffffa0776019 [iw_cm]
#18 [ffff88084d323af0] cm_work_handler at ffffffffa0776170 [iw_cm]
#19 [ffff88084d323cb0] process_one_work at ffffffff810a1483
#20 [ffff88084d323d90] worker_thread at ffffffff810a211d
#21 [ffff88084d323ec0] kthread at ffffffff810a6c5c
#22 [ffff88084d323f50] ret_from_fork at ffffffff816e1ebf

Fixes: 632bc3f ("IB/core, RDMA RW API: Do not exceed QP SGE send limit")
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

popcornmix pushed a commit that referenced this issue Jan 7, 2017

xfs: fix up xfs_swap_extent_forks inline extent handling
commit 4dfce57 upstream.

There have been several reports over the years of NULL pointer
dereferences in xfs_trans_log_inode during xfs_fsr processes,
when the process is doing an fput and tearing down extents
on the temporary inode, something like:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
PID: 29439  TASK: ffff880550584fa0  CPU: 6   COMMAND: "xfs_fsr"
    [exception RIP: xfs_trans_log_inode+0x10]
 #9 [ffff8800a57bbbe0] xfs_bunmapi at ffffffffa037398e [xfs]
#10 [ffff8800a57bbce8] xfs_itruncate_extents at ffffffffa0391b29 [xfs]
#11 [ffff8800a57bbd88] xfs_inactive_truncate at ffffffffa0391d0c [xfs]
#12 [ffff8800a57bbdb8] xfs_inactive at ffffffffa0392508 [xfs]
#13 [ffff8800a57bbdd8] xfs_fs_evict_inode at ffffffffa035907e [xfs]
#14 [ffff8800a57bbe00] evict at ffffffff811e1b67
#15 [ffff8800a57bbe28] iput at ffffffff811e23a5
#16 [ffff8800a57bbe58] dentry_kill at ffffffff811dcfc8
#17 [ffff8800a57bbe88] dput at ffffffff811dd06c
#18 [ffff8800a57bbea8] __fput at ffffffff811c823b
#19 [ffff8800a57bbef0] ____fput at ffffffff811c846e
#20 [ffff8800a57bbf00] task_work_run at ffffffff81093b27
#21 [ffff8800a57bbf30] do_notify_resume at ffffffff81013b0c
#22 [ffff8800a57bbf50] int_signal at ffffffff8161405d

As it turns out, this is because the i_itemp pointer, along
with the d_ops pointer, has been overwritten with zeros
when we tear down the extents during truncate.  When the in-core
inode fork on the temporary inode used by xfs_fsr was originally
set up during the extent swap, we mistakenly looked at di_nextents
to determine whether all extents fit inline, but this misses extents
generated by speculative preallocation; we should be using if_bytes
instead.

This mistake corrupts the in-memory inode, and code in
xfs_iext_remove_inline eventually gets bad inputs, causing
it to memmove and memset incorrect ranges; this became apparent
because the two values in ifp->if_u2.if_inline_ext[1] contained
what should have been in d_ops and i_itemp; they were memmoved due
to incorrect array indexing and then the original locations
were zeroed with memset, again due to an array overrun.

Fix this by properly using i_df.if_bytes to determine the number
of extents, not di_nextents.

Thanks to dchinner for looking at this with me and spotting the
root cause.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

popcornmix pushed a commit that referenced this issue Mar 21, 2017

dccp/tcp: fix routing redirect race
As Eric Dumazet pointed out this also needs to be fixed in IPv6.
v2: Contains the IPv6 tcp/Ipv6 dccp patches as well.

We have seen a few incidents lately where a dst_enty has been freed
with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that
dst_entry. If the conditions/timings are right a crash then ensues when the
freed dst_entry is referenced later on. A Common crashing back trace is:

 #8 [] page_fault at ffffffff8163e648
    [exception RIP: __tcp_ack_snd_check+74]
.
.
 #9 [] tcp_rcv_established at ffffffff81580b64
#10 [] tcp_v4_do_rcv at ffffffff8158b54a
#11 [] tcp_v4_rcv at ffffffff8158cd02
#12 [] ip_local_deliver_finish at ffffffff815668f4
#13 [] ip_local_deliver at ffffffff81566bd9
#14 [] ip_rcv_finish at ffffffff8156656d
#15 [] ip_rcv at ffffffff81566f06
#16 [] __netif_receive_skb_core at ffffffff8152b3a2
#17 [] __netif_receive_skb at ffffffff8152b608
#18 [] netif_receive_skb at ffffffff8152b690
#19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3]
#20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3]
#21 [] net_rx_action at ffffffff8152bac2
#22 [] __do_softirq at ffffffff81084b4f
#23 [] call_softirq at ffffffff8164845c
#24 [] do_softirq at ffffffff81016fc5
#25 [] irq_exit at ffffffff81084ee5
#26 [] do_IRQ at ffffffff81648ff8

Of course it may happen with other NIC drivers as well.

It's found the freed dst_entry here:

 224 static bool tcp_in_quickack_mode(struct sock *sk)↩
 225 {↩
 226 ▹       const struct inet_connection_sock *icsk = inet_csk(sk);↩
 227 ▹       const struct dst_entry *dst = __sk_dst_get(sk);↩
 228 ↩
 229 ▹       return (dst && dst_metric(dst, RTAX_QUICKACK)) ||↩
 230 ▹       ▹       (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);↩
 231 }↩

But there are other backtraces attributed to the same freed dst_entry in
netfilter code as well.

All the vmcores showed 2 significant clues:

- Remote hosts behind the default gateway had always been redirected to a
different gateway. A rtable/dst_entry will be added for that host. Making
more dst_entrys with lower reference counts. Making this more probable.

- All vmcores showed a postitive LockDroppedIcmps value, e.g:

LockDroppedIcmps                  267

A closer look at the tcp_v4_err() handler revealed that do_redirect() will run
regardless of whether user space has the socket locked. This can result in a
race condition where the same dst_entry cached in sk->sk_dst_entry can be
decremented twice for the same socket via:

do_redirect()->__sk_dst_check()-> dst_release().

Which leads to the dst_entry being prematurely freed with another socket
pointing to it via sk->sk_dst_cache and a subsequent crash.

To fix this skip do_redirect() if usespace has the socket locked. Instead let
the redirect take place later when user space does not have the socket
locked.

The dccp/IPv6 code is very similar in this respect, so fixing it there too.

As Eric Garver pointed out the following commit now invalidates routes. Which
can set the dst->obsolete flag so that ipv4_dst_check() returns null and
triggers the dst_release().

Fixes: ceb3320 ("ipv4: Kill routes during PMTU/redirect updates.")
Cc: Eric Garver <egarver@redhat.com>
Cc: Hannes Sowa <hsowa@redhat.com>
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ED6E0F17 pushed a commit to ED6E0F17/linux that referenced this issue Mar 22, 2017

dccp/tcp: fix routing redirect race
[ Upstream commit 45caeaa ]

As Eric Dumazet pointed out this also needs to be fixed in IPv6.
v2: Contains the IPv6 tcp/Ipv6 dccp patches as well.

We have seen a few incidents lately where a dst_enty has been freed
with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that
dst_entry. If the conditions/timings are right a crash then ensues when the
freed dst_entry is referenced later on. A Common crashing back trace is:

 #8 [] page_fault at ffffffff8163e648
    [exception RIP: __tcp_ack_snd_check+74]
.
.
 #9 [] tcp_rcv_established at ffffffff81580b64
#10 [] tcp_v4_do_rcv at ffffffff8158b54a
#11 [] tcp_v4_rcv at ffffffff8158cd02
#12 [] ip_local_deliver_finish at ffffffff815668f4
#13 [] ip_local_deliver at ffffffff81566bd9
#14 [] ip_rcv_finish at ffffffff8156656d
#15 [] ip_rcv at ffffffff81566f06
#16 [] __netif_receive_skb_core at ffffffff8152b3a2
#17 [] __netif_receive_skb at ffffffff8152b608
#18 [] netif_receive_skb at ffffffff8152b690
#19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3]
#20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3]
#21 [] net_rx_action at ffffffff8152bac2
#22 [] __do_softirq at ffffffff81084b4f
#23 [] call_softirq at ffffffff8164845c
#24 [] do_softirq at ffffffff81016fc5
#25 [] irq_exit at ffffffff81084ee5
#26 [] do_IRQ at ffffffff81648ff8

Of course it may happen with other NIC drivers as well.

It's found the freed dst_entry here:

 224 static bool tcp_in_quickack_mode(struct sock *sk)↩
 225 {↩
 226 ▹       const struct inet_connection_sock *icsk = inet_csk(sk);↩
 227 ▹       const struct dst_entry *dst = __sk_dst_get(sk);↩
 228 ↩
 229 ▹       return (dst && dst_metric(dst, RTAX_QUICKACK)) ||↩
 230 ▹       ▹       (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);↩
 231 }↩

But there are other backtraces attributed to the same freed dst_entry in
netfilter code as well.

All the vmcores showed 2 significant clues:

- Remote hosts behind the default gateway had always been redirected to a
different gateway. A rtable/dst_entry will be added for that host. Making
more dst_entrys with lower reference counts. Making this more probable.

- All vmcores showed a postitive LockDroppedIcmps value, e.g:

LockDroppedIcmps                  267

A closer look at the tcp_v4_err() handler revealed that do_redirect() will run
regardless of whether user space has the socket locked. This can result in a
race condition where the same dst_entry cached in sk->sk_dst_entry can be
decremented twice for the same socket via:

do_redirect()->__sk_dst_check()-> dst_release().

Which leads to the dst_entry being prematurely freed with another socket
pointing to it via sk->sk_dst_cache and a subsequent crash.

To fix this skip do_redirect() if usespace has the socket locked. Instead let
the redirect take place later when user space does not have the socket
locked.

The dccp/IPv6 code is very similar in this respect, so fixing it there too.

As Eric Garver pointed out the following commit now invalidates routes. Which
can set the dst->obsolete flag so that ipv4_dst_check() returns null and
triggers the dst_release().

Fixes: ceb3320 ("ipv4: Kill routes during PMTU/redirect updates.")
Cc: Eric Garver <egarver@redhat.com>
Cc: Hannes Sowa <hsowa@redhat.com>
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

popcornmix pushed a commit that referenced this issue Mar 23, 2017

dccp/tcp: fix routing redirect race
[ Upstream commit 45caeaa ]

As Eric Dumazet pointed out this also needs to be fixed in IPv6.
v2: Contains the IPv6 tcp/Ipv6 dccp patches as well.

We have seen a few incidents lately where a dst_enty has been freed
with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that
dst_entry. If the conditions/timings are right a crash then ensues when the
freed dst_entry is referenced later on. A Common crashing back trace is:

 #8 [] page_fault at ffffffff8163e648
    [exception RIP: __tcp_ack_snd_check+74]
.
.
 #9 [] tcp_rcv_established at ffffffff81580b64
#10 [] tcp_v4_do_rcv at ffffffff8158b54a
#11 [] tcp_v4_rcv at ffffffff8158cd02
#12 [] ip_local_deliver_finish at ffffffff815668f4
#13 [] ip_local_deliver at ffffffff81566bd9
#14 [] ip_rcv_finish at ffffffff8156656d
#15 [] ip_rcv at ffffffff81566f06
#16 [] __netif_receive_skb_core at ffffffff8152b3a2
#17 [] __netif_receive_skb at ffffffff8152b608
#18 [] netif_receive_skb at ffffffff8152b690
#19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3]
#20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3]
#21 [] net_rx_action at ffffffff8152bac2
#22 [] __do_softirq at ffffffff81084b4f
#23 [] call_softirq at ffffffff8164845c
#24 [] do_softirq at ffffffff81016fc5
#25 [] irq_exit at ffffffff81084ee5
#26 [] do_IRQ at ffffffff81648ff8

Of course it may happen with other NIC drivers as well.

It's found the freed dst_entry here:

 224 static bool tcp_in_quickack_mode(struct sock *sk)↩
 225 {↩
 226 ▹       const struct inet_connection_sock *icsk = inet_csk(sk);↩
 227 ▹       const struct dst_entry *dst = __sk_dst_get(sk);↩
 228 ↩
 229 ▹       return (dst && dst_metric(dst, RTAX_QUICKACK)) ||↩
 230 ▹       ▹       (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);↩
 231 }↩

But there are other backtraces attributed to the same freed dst_entry in
netfilter code as well.

All the vmcores showed 2 significant clues:

- Remote hosts behind the default gateway had always been redirected to a
different gateway. A rtable/dst_entry will be added for that host. Making
more dst_entrys with lower reference counts. Making this more probable.

- All vmcores showed a postitive LockDroppedIcmps value, e.g:

LockDroppedIcmps                  267

A closer look at the tcp_v4_err() handler revealed that do_redirect() will run
regardless of whether user space has the socket locked. This can result in a
race condition where the same dst_entry cached in sk->sk_dst_entry can be
decremented twice for the same socket via:

do_redirect()->__sk_dst_check()-> dst_release().

Which leads to the dst_entry being prematurely freed with another socket
pointing to it via sk->sk_dst_cache and a subsequent crash.

To fix this skip do_redirect() if usespace has the socket locked. Instead let
the redirect take place later when user space does not have the socket
locked.

The dccp/IPv6 code is very similar in this respect, so fixing it there too.

As Eric Garver pointed out the following commit now invalidates routes. Which
can set the dst->obsolete flag so that ipv4_dst_check() returns null and
triggers the dst_release().

Fixes: ceb3320 ("ipv4: Kill routes during PMTU/redirect updates.")
Cc: Eric Garver <egarver@redhat.com>
Cc: Hannes Sowa <hsowa@redhat.com>
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

popcornmix added a commit that referenced this issue Mar 23, 2017

dccp/tcp: fix routing redirect race
[ Upstream commit 45caeaa ]

As Eric Dumazet pointed out this also needs to be fixed in IPv6.
v2: Contains the IPv6 tcp/Ipv6 dccp patches as well.

We have seen a few incidents lately where a dst_enty has been freed
with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that
dst_entry. If the conditions/timings are right a crash then ensues when the
freed dst_entry is referenced later on. A Common crashing back trace is:

 #8 [] page_fault at ffffffff8163e648
    [exception RIP: __tcp_ack_snd_check+74]
.
.
 #9 [] tcp_rcv_established at ffffffff81580b64
#10 [] tcp_v4_do_rcv at ffffffff8158b54a
#11 [] tcp_v4_rcv at ffffffff8158cd02
#12 [] ip_local_deliver_finish at ffffffff815668f4
#13 [] ip_local_deliver at ffffffff81566bd9
#14 [] ip_rcv_finish at ffffffff8156656d
#15 [] ip_rcv at ffffffff81566f06
#16 [] __netif_receive_skb_core at ffffffff8152b3a2
#17 [] __netif_receive_skb at ffffffff8152b608
#18 [] netif_receive_skb at ffffffff8152b690
#19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3]
#20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3]
#21 [] net_rx_action at ffffffff8152bac2
#22 [] __do_softirq at ffffffff81084b4f
#23 [] call_softirq at ffffffff8164845c
#24 [] do_softirq at ffffffff81016fc5
#25 [] irq_exit at ffffffff81084ee5
#26 [] do_IRQ at ffffffff81648ff8

Of course it may happen with other NIC drivers as well.

It's found the freed dst_entry here:

 224 static bool tcp_in_quickack_mode(struct sock *sk)↩
 225 {↩
 226 ▹       const struct inet_connection_sock *icsk = inet_csk(sk);↩
 227 ▹       const struct dst_entry *dst = __sk_dst_get(sk);↩
 228 ↩
 229 ▹       return (dst && dst_metric(dst, RTAX_QUICKACK)) ||↩
 230 ▹       ▹       (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);↩
 231 }↩

But there are other backtraces attributed to the same freed dst_entry in
netfilter code as well.

All the vmcores showed 2 significant clues:

- Remote hosts behind the default gateway had always been redirected to a
different gateway. A rtable/dst_entry will be added for that host. Making
more dst_entrys with lower reference counts. Making this more probable.

- All vmcores showed a postitive LockDroppedIcmps value, e.g:

LockDroppedIcmps                  267

A closer look at the tcp_v4_err() handler revealed that do_redirect() will run
regardless of whether user space has the socket locked. This can result in a
race condition where the same dst_entry cached in sk->sk_dst_entry can be
decremented twice for the same socket via:

do_redirect()->__sk_dst_check()-> dst_release().

Which leads to the dst_entry being prematurely freed with another socket
pointing to it via sk->sk_dst_cache and a subsequent crash.

To fix this skip do_redirect() if usespace has the socket locked. Instead let
the redirect take place later when user space does not have the socket
locked.

The dccp/IPv6 code is very similar in this respect, so fixing it there too.

As Eric Garver pointed out the following commit now invalidates routes. Which
can set the dst->obsolete flag so that ipv4_dst_check() returns null and
triggers the dst_release().

Fixes: ceb3320 ("ipv4: Kill routes during PMTU/redirect updates.")
Cc: Eric Garver <egarver@redhat.com>
Cc: Hannes Sowa <hsowa@redhat.com>
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

dlech pushed a commit to ev3dev/rpi-kernel that referenced this issue Apr 14, 2017

dccp/tcp: fix routing redirect race
[ Upstream commit 45caeaa ]

As Eric Dumazet pointed out this also needs to be fixed in IPv6.
v2: Contains the IPv6 tcp/Ipv6 dccp patches as well.

We have seen a few incidents lately where a dst_enty has been freed
with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that
dst_entry. If the conditions/timings are right a crash then ensues when the
freed dst_entry is referenced later on. A Common crashing back trace is:

 #8 [] page_fault at ffffffff8163e648
    [exception RIP: __tcp_ack_snd_check+74]
.
.
 #9 [] tcp_rcv_established at ffffffff81580b64
#10 [] tcp_v4_do_rcv at ffffffff8158b54a
#11 [] tcp_v4_rcv at ffffffff8158cd02
#12 [] ip_local_deliver_finish at ffffffff815668f4
#13 [] ip_local_deliver at ffffffff81566bd9
#14 [] ip_rcv_finish at ffffffff8156656d
#15 [] ip_rcv at ffffffff81566f06
#16 [] __netif_receive_skb_core at ffffffff8152b3a2
#17 [] __netif_receive_skb at ffffffff8152b608
#18 [] netif_receive_skb at ffffffff8152b690
#19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3]
#20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3]
#21 [] net_rx_action at ffffffff8152bac2
#22 [] __do_softirq at ffffffff81084b4f
#23 [] call_softirq at ffffffff8164845c
#24 [] do_softirq at ffffffff81016fc5
#25 [] irq_exit at ffffffff81084ee5
#26 [] do_IRQ at ffffffff81648ff8

Of course it may happen with other NIC drivers as well.

It's found the freed dst_entry here:

 224 static bool tcp_in_quickack_mode(struct sock *sk)↩
 225 {↩
 226 ▹       const struct inet_connection_sock *icsk = inet_csk(sk);↩
 227 ▹       const struct dst_entry *dst = __sk_dst_get(sk);↩
 228 ↩
 229 ▹       return (dst && dst_metric(dst, RTAX_QUICKACK)) ||↩
 230 ▹       ▹       (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);↩
 231 }↩

But there are other backtraces attributed to the same freed dst_entry in
netfilter code as well.

All the vmcores showed 2 significant clues:

- Remote hosts behind the default gateway had always been redirected to a
different gateway. A rtable/dst_entry will be added for that host. Making
more dst_entrys with lower reference counts. Making this more probable.

- All vmcores showed a postitive LockDroppedIcmps value, e.g:

LockDroppedIcmps                  267

A closer look at the tcp_v4_err() handler revealed that do_redirect() will run
regardless of whether user space has the socket locked. This can result in a
race condition where the same dst_entry cached in sk->sk_dst_entry can be
decremented twice for the same socket via:

do_redirect()->__sk_dst_check()-> dst_release().

Which leads to the dst_entry being prematurely freed with another socket
pointing to it via sk->sk_dst_cache and a subsequent crash.

To fix this skip do_redirect() if usespace has the socket locked. Instead let
the redirect take place later when user space does not have the socket
locked.

The dccp/IPv6 code is very similar in this respect, so fixing it there too.

As Eric Garver pointed out the following commit now invalidates routes. Which
can set the dst->obsolete flag so that ipv4_dst_check() returns null and
triggers the dst_release().

Fixes: ceb3320 ("ipv4: Kill routes during PMTU/redirect updates.")
Cc: Eric Garver <egarver@redhat.com>
Cc: Hannes Sowa <hsowa@redhat.com>
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

dlech pushed a commit to ev3dev/rpi-kernel that referenced this issue Apr 14, 2017

xfs: fix up xfs_swap_extent_forks inline extent handling
commit 4dfce57 upstream.

There have been several reports over the years of NULL pointer
dereferences in xfs_trans_log_inode during xfs_fsr processes,
when the process is doing an fput and tearing down extents
on the temporary inode, something like:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
PID: 29439  TASK: ffff880550584fa0  CPU: 6   COMMAND: "xfs_fsr"
    [exception RIP: xfs_trans_log_inode+0x10]
 #9 [ffff8800a57bbbe0] xfs_bunmapi at ffffffffa037398e [xfs]
#10 [ffff8800a57bbce8] xfs_itruncate_extents at ffffffffa0391b29 [xfs]
#11 [ffff8800a57bbd88] xfs_inactive_truncate at ffffffffa0391d0c [xfs]
#12 [ffff8800a57bbdb8] xfs_inactive at ffffffffa0392508 [xfs]
#13 [ffff8800a57bbdd8] xfs_fs_evict_inode at ffffffffa035907e [xfs]
#14 [ffff8800a57bbe00] evict at ffffffff811e1b67
#15 [ffff8800a57bbe28] iput at ffffffff811e23a5
#16 [ffff8800a57bbe58] dentry_kill at ffffffff811dcfc8
#17 [ffff8800a57bbe88] dput at ffffffff811dd06c
#18 [ffff8800a57bbea8] __fput at ffffffff811c823b
#19 [ffff8800a57bbef0] ____fput at ffffffff811c846e
#20 [ffff8800a57bbf00] task_work_run at ffffffff81093b27
#21 [ffff8800a57bbf30] do_notify_resume at ffffffff81013b0c
#22 [ffff8800a57bbf50] int_signal at ffffffff8161405d

As it turns out, this is because the i_itemp pointer, along
with the d_ops pointer, has been overwritten with zeros
when we tear down the extents during truncate.  When the in-core
inode fork on the temporary inode used by xfs_fsr was originally
set up during the extent swap, we mistakenly looked at di_nextents
to determine whether all extents fit inline, but this misses extents
generated by speculative preallocation; we should be using if_bytes
instead.

This mistake corrupts the in-memory inode, and code in
xfs_iext_remove_inline eventually gets bad inputs, causing
it to memmove and memset incorrect ranges; this became apparent
because the two values in ifp->if_u2.if_inline_ext[1] contained
what should have been in d_ops and i_itemp; they were memmoved due
to incorrect array indexing and then the original locations
were zeroed with memset, again due to an array overrun.

Fix this by properly using i_df.if_bytes to determine the number
of extents, not di_nextents.

Thanks to dchinner for looking at this with me and spotting the
root cause.

[nborisov: backported to 4.4]

Cc: stable@vger.kernel.org
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
--
 fs/xfs/xfs_bmap_util.c |    7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

popcornmix pushed a commit that referenced this issue Aug 8, 2017

mailbox: pcc: Fix crash when request PCC channel 0
When PCCT is not available, kernel crashes as below when requests PCC
channel 0. This patch fixes this issue.

[    0.920454] PCCT header not found.
...
[    8.031309] Unable to handle kernel NULL pointer dereference at virtual address 00000010
[    8.031310] [0000000000000010] user address but active_mm is swapper
[    8.031312] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[    8.031313] Modules linked in:
[    8.031316] CPU: 31 PID: 1 Comm: swapper/0 Tainted: G        W       4.13.0-rc1 #18
[    8.031317] Hardware name: AppliedMicro(R) 07/20/2017
[    8.031318] task: ffff809ef3b08000 task.stack: ffff809ef3b10000
[    8.031322] PC is at pcc_mbox_request_channel+0x8c/0x160
[    8.031325] LR is at xgene_slimpro_i2c_probe+0x1c0/0x378
[    8.031326] pc : [<ffff000008899450>] lr : [<ffff000008819dac>] pstate: 00000045
[    8.031327] sp : ffff809ef3b13bd0
[    8.031327] x29: ffff809ef3b13bd0 x28: ffff000008ed90a0
[    8.031329] x27: ffff000009091000 x26: ffff000008e50470
[    8.031330] x25: ffff000008ed9100 x24: ffff809eefd9ac30
[    8.031332] x23: 0000000000000000 x22: ffff0000090e3e10
[    8.031333] x21: ffff0000090e3000 x20: 0000000000000000
[    8.031335] x19: 0000000000000000 x18: 0000000000087ffc
[    8.031336] x17: 2fe48d76a78303f0 x16: 0000000000087ffc
[    8.031337] x15: ffff000000000000 x14: 0000000000000000
[    8.031339] x13: 0000000000000000 x12: 0000000000000018
[    8.031340] x11: 0000000000000018 x10: 0101010101010101
[    8.031342] x9 : 0000000000000000 x8 : 7f7f7f7f7f7f7f7f
[    8.031343] x7 : fefefefeff6b646d x6 : 0000008080808080
[    8.031345] x5 : 0000000000000000 x4 : 0000000000000001
[    8.031346] x3 : 0000000000000000 x2 : ffff000008819b64
[    8.031348] x1 : 0000000000000000 x0 : 0000000000000000
...
[    8.031393] Call trace:
[    8.031394] Exception stack(0xffff809ef3b13a00 to 0xffff809ef3b13b30)
[    8.031395] 3a00: 0000000000000000 0001000000000000 ffff809ef3b13bd0 ffff000008899450
[    8.031397] 3a20: ffff809f7e1f9a10 ffff000008f60be0 0000000000000001 ffff809ef3b13b7c
[    8.031398] 3a40: ffff809f7e1f9a10 0000000000000000 ffff000009091000 0000000000000003
[    8.031399] 3a60: ffff000009091000 0000000000000003 ffff809ef3b13a80 ffff0000084e0794
[    8.031400] 3a80: ffff809ef3b13a90 ffff00000850bb64 ffff809ef3b13ad0 ffff00000850bf34
[    8.031402] 3aa0: 0000000000000000 0000000000000000 ffff000008819b64 0000000000000000
[    8.031403] 3ac0: 0000000000000001 0000000000000000 0000008080808080 fefefefeff6b646d
[    8.031404] 3ae0: 7f7f7f7f7f7f7f7f 0000000000000000 0101010101010101 0000000000000018
[    8.031405] 3b00: 0000000000000018 0000000000000000 0000000000000000 ffff000000000000
[    8.031406] 3b20: 0000000000087ffc 2fe48d76a78303f0
[    8.031409] [<ffff000008899450>] pcc_mbox_request_channel+0x8c/0x160
[    8.031410] [<ffff000008819dac>] xgene_slimpro_i2c_probe+0x1c0/0x378
[    8.031413] [<ffff0000085e84dc>] platform_drv_probe+0x50/0xbc
[    8.031414] [<ffff0000085e68a4>] driver_probe_device+0x21c/0x2d0
[    8.031416] [<ffff0000085e6a04>] __driver_attach+0xac/0xb0
[    8.031417] [<ffff0000085e4a78>] bus_for_each_dev+0x58/0x98
[    8.031418] [<ffff0000085e61e4>] driver_attach+0x20/0x28
[    8.031419] [<ffff0000085e5e0c>] bus_add_driver+0x1c8/0x22c
[    8.031421] [<ffff0000085e7324>] driver_register+0x60/0xf4
[    8.031422] [<ffff0000085e8420>] __platform_driver_register+0x4c/0x54
[    8.031425] [<ffff000008e96dd0>] xgene_slimpro_i2c_driver_init+0x18/0x20
[    8.031426] [<ffff000008083144>] do_one_initcall+0x38/0x124
[    8.031429] [<ffff000008e50d0c>] kernel_init_freeable+0x190/0x22c
[    8.031431] [<ffff0000089eac30>] kernel_init+0x10/0xfc
[    8.031432] [<ffff000008082ec0>] ret_from_fork+0x10/0x50
[    8.031434] Code: cb030e63 8b030013 b140067f 54fffda8 (f9400a61)
[    8.031448] ---[ end trace 14eb48a4e1e1f9fb ]---

Signed-off-by: Hoan Tran <hotran@apm.com>
Acked-by: Prashanth Prakash <pprakash@codeaurora.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

popcornmix pushed a commit that referenced this issue Dec 29, 2017

KVM: Fix stack-out-of-bounds read in write_mmio
Reported by syzkaller:

  BUG: KASAN: stack-out-of-bounds in write_mmio+0x11e/0x270 [kvm]
  Read of size 8 at addr ffff8803259df7f8 by task syz-executor/32298

  CPU: 6 PID: 32298 Comm: syz-executor Tainted: G           OE    4.15.0-rc2+ #18
  Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
  Call Trace:
   dump_stack+0xab/0xe1
   print_address_description+0x6b/0x290
   kasan_report+0x28a/0x370
   write_mmio+0x11e/0x270 [kvm]
   emulator_read_write_onepage+0x311/0x600 [kvm]
   emulator_read_write+0xef/0x240 [kvm]
   emulator_fix_hypercall+0x105/0x150 [kvm]
   em_hypercall+0x2b/0x80 [kvm]
   x86_emulate_insn+0x2b1/0x1640 [kvm]
   x86_emulate_instruction+0x39a/0xb90 [kvm]
   handle_exception+0x1b4/0x4d0 [kvm_intel]
   vcpu_enter_guest+0x15a0/0x2640 [kvm]
   kvm_arch_vcpu_ioctl_run+0x549/0x7d0 [kvm]
   kvm_vcpu_ioctl+0x479/0x880 [kvm]
   do_vfs_ioctl+0x142/0x9a0
   SyS_ioctl+0x74/0x80
   entry_SYSCALL_64_fastpath+0x23/0x9a

The path of patched vmmcall will patch 3 bytes opcode 0F 01 C1(vmcall)
to the guest memory, however, write_mmio tracepoint always prints 8 bytes
through *(u64 *)val since kvm splits the mmio access into 8 bytes. This
leaks 5 bytes from the kernel stack (CVE-2017-17741).  This patch fixes
it by just accessing the bytes which we operate on.

Before patch:

syz-executor-5567  [007] .... 51370.561696: kvm_mmio: mmio write len 3 gpa 0x10 val 0x1ffff10077c1010f

After patch:

syz-executor-13416 [002] .... 51302.299573: kvm_mmio: mmio write len 3 gpa 0x10 val 0xc1010f

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Tested-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

popcornmix pushed a commit that referenced this issue Jan 17, 2018

KVM: Fix stack-out-of-bounds read in write_mmio
commit e39d200 upstream.

Reported by syzkaller:

  BUG: KASAN: stack-out-of-bounds in write_mmio+0x11e/0x270 [kvm]
  Read of size 8 at addr ffff8803259df7f8 by task syz-executor/32298

  CPU: 6 PID: 32298 Comm: syz-executor Tainted: G           OE    4.15.0-rc2+ #18
  Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
  Call Trace:
   dump_stack+0xab/0xe1
   print_address_description+0x6b/0x290
   kasan_report+0x28a/0x370
   write_mmio+0x11e/0x270 [kvm]
   emulator_read_write_onepage+0x311/0x600 [kvm]
   emulator_read_write+0xef/0x240 [kvm]
   emulator_fix_hypercall+0x105/0x150 [kvm]
   em_hypercall+0x2b/0x80 [kvm]
   x86_emulate_insn+0x2b1/0x1640 [kvm]
   x86_emulate_instruction+0x39a/0xb90 [kvm]
   handle_exception+0x1b4/0x4d0 [kvm_intel]
   vcpu_enter_guest+0x15a0/0x2640 [kvm]
   kvm_arch_vcpu_ioctl_run+0x549/0x7d0 [kvm]
   kvm_vcpu_ioctl+0x479/0x880 [kvm]
   do_vfs_ioctl+0x142/0x9a0
   SyS_ioctl+0x74/0x80
   entry_SYSCALL_64_fastpath+0x23/0x9a

The path of patched vmmcall will patch 3 bytes opcode 0F 01 C1(vmcall)
to the guest memory, however, write_mmio tracepoint always prints 8 bytes
through *(u64 *)val since kvm splits the mmio access into 8 bytes. This
leaks 5 bytes from the kernel stack (CVE-2017-17741).  This patch fixes
it by just accessing the bytes which we operate on.

Before patch:

syz-executor-5567  [007] .... 51370.561696: kvm_mmio: mmio write len 3 gpa 0x10 val 0x1ffff10077c1010f

After patch:

syz-executor-13416 [002] .... 51302.299573: kvm_mmio: mmio write len 3 gpa 0x10 val 0xc1010f

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Tested-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

popcornmix pushed a commit that referenced this issue Jan 17, 2018

KVM: Fix stack-out-of-bounds read in write_mmio
commit e39d200 upstream.

Reported by syzkaller:

  BUG: KASAN: stack-out-of-bounds in write_mmio+0x11e/0x270 [kvm]
  Read of size 8 at addr ffff8803259df7f8 by task syz-executor/32298

  CPU: 6 PID: 32298 Comm: syz-executor Tainted: G           OE    4.15.0-rc2+ #18
  Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
  Call Trace:
   dump_stack+0xab/0xe1
   print_address_description+0x6b/0x290
   kasan_report+0x28a/0x370
   write_mmio+0x11e/0x270 [kvm]
   emulator_read_write_onepage+0x311/0x600 [kvm]
   emulator_read_write+0xef/0x240 [kvm]
   emulator_fix_hypercall+0x105/0x150 [kvm]
   em_hypercall+0x2b/0x80 [kvm]
   x86_emulate_insn+0x2b1/0x1640 [kvm]
   x86_emulate_instruction+0x39a/0xb90 [kvm]
   handle_exception+0x1b4/0x4d0 [kvm_intel]
   vcpu_enter_guest+0x15a0/0x2640 [kvm]
   kvm_arch_vcpu_ioctl_run+0x549/0x7d0 [kvm]
   kvm_vcpu_ioctl+0x479/0x880 [kvm]
   do_vfs_ioctl+0x142/0x9a0
   SyS_ioctl+0x74/0x80
   entry_SYSCALL_64_fastpath+0x23/0x9a

The path of patched vmmcall will patch 3 bytes opcode 0F 01 C1(vmcall)
to the guest memory, however, write_mmio tracepoint always prints 8 bytes
through *(u64 *)val since kvm splits the mmio access into 8 bytes. This
leaks 5 bytes from the kernel stack (CVE-2017-17741).  This patch fixes
it by just accessing the bytes which we operate on.

Before patch:

syz-executor-5567  [007] .... 51370.561696: kvm_mmio: mmio write len 3 gpa 0x10 val 0x1ffff10077c1010f

After patch:

syz-executor-13416 [002] .... 51302.299573: kvm_mmio: mmio write len 3 gpa 0x10 val 0xc1010f

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Tested-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment