Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bootloader offset #16

Closed
nomis opened this issue May 7, 2012 · 23 comments
Closed

Bootloader offset #16

nomis opened this issue May 7, 2012 · 23 comments

Comments

@nomis
Copy link
Contributor

nomis commented May 7, 2012

Could the bootloader be changed to load kernel.img at 0x8000 and provide the instructions and data for the first 32K itself?

Alternatively it could load a separate file for the first 32K and then kernel.bin at 0x8000?

Then there's no need to put extra data on the start of the kernel image every time.

@popcornmix
Copy link
Collaborator

popcornmix commented May 8, 2012

What's the motivation for this? Saving 32K of disk space? Making build process simpler? of is there a more significant benefit?
I'm a little wary of just changing the convention, as it makes it harder to switch kernels/firmware from before/after the event.
I could add a new config.txt setting to switch between the two behaviours.
I could see if md5sum of first 32K of kernel matches the normal version, and if not, then load it at 0x8000 and fill in the normal version's first32k. Any preference?

@nomis
Copy link
Contributor Author

nomis commented May 8, 2012

The motivation is to make the build process simpler as it looks unlikely that the current process would ever get integrated with the kernel build process as the kernel shouldn't be including a fixed machine ID. It's also awkward even as a memory image that another bootloader then partially overwrites to provide arguments.

No one can use the first 32KB anyway because of the kernel arguments being written at 0x100.

I'm not sure what the best option is. I think the current boot process may loop if loaded at 0x8000.

@popcornmix
Copy link
Collaborator

popcornmix commented May 8, 2012

The latest github start.elf supports config.txt option:
disable_commandline_tags=1
which avoids writing kernel arguments.

How about:
if (disable_commandline_tags=1) load kernel.img at 0. Don't fill in arguments.
else if (md5sum first32k matches) load kernel.img at 0 and fill in aguments (*)
else load kernel.img at 0x8000, GPU inserts first32K, and fills in aguments

Case (*) is deprecated and removed at some point in future.

@nomis
Copy link
Contributor Author

nomis commented May 8, 2012

That sounds reasonable but you'll need to warn people that it's happening as I know haiku is already using a "b 0x8000" as the only opcode in the first 32KB.

bootc pushed a commit to bootc/linux-rpi-orig that referenced this issue May 8, 2012
…S block during isolation for migration

commit 0bf380b upstream.

When isolating for migration, migration starts at the start of a zone
which is not necessarily pageblock aligned.  Further, it stops isolating
when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally
not aligned.  This allows isolate_migratepages() to call pfn_to_page() on
an invalid PFN which can result in a crash.  This was originally reported
against a 3.0-based kernel with the following trace in a crash dump.

PID: 9902   TASK: d47aecd0  CPU: 0   COMMAND: "memcg_process_s"
 #0 [d72d3ad0] crash_kexec at c028cfdb
 raspberrypi#1 [d72d3b24] oops_end at c05c5322
 raspberrypi#2 [d72d3b38] __bad_area_nosemaphore at c0227e60
 raspberrypi#3 [d72d3bec] bad_area at c0227fb6
 raspberrypi#4 [d72d3c00] do_page_fault at c05c72ec
 raspberrypi#5 [d72d3c80] error_code (via page_fault) at c05c47a4
    EAX: 00000000  EBX: 000c0000  ECX: 00000001  EDX: 00000807  EBP: 000c0000
    DS:  007b      ESI: 00000001  ES:  007b      EDI: f3000a80  GS:  6f50
    CS:  0060      EIP: c030b15a  ERR: ffffffff  EFLAGS: 00010002
 raspberrypi#6 [d72d3cb4] isolate_migratepages at c030b15a
 raspberrypi#7 [d72d3d14] zone_watermark_ok at c02d26cb
 raspberrypi#8 [d72d3d2c] compact_zone at c030b8de
 raspberrypi#9 [d72d3d68] compact_zone_order at c030bba1
raspberrypi#10 [d72d3db4] try_to_compact_pages at c030bc84
raspberrypi#11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7
raspberrypi#12 [d72d3e08] __alloc_pages_slowpath at c02d66c7
raspberrypi#13 [d72d3e78] __alloc_pages_nodemask at c02d6a97
raspberrypi#14 [d72d3eb8] alloc_pages_vma at c030a845
raspberrypi#15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb
raspberrypi#16 [d72d3f00] handle_mm_fault at c02f36c6
raspberrypi#17 [d72d3f30] do_page_fault at c05c70ed
raspberrypi#18 [d72d3fb0] error_code (via page_fault) at c05c47a4
    EAX: b71ff00  EBX: 00000001  ECX: 00001600  EDX: 00000431
    DS:  007b      ESI: 08048950  ES:  007b      EDI: bfaa3788
    SS:  007b      ESP: bfaa36e0  EBP: bfaa3828  GS:  6f50
    CS:  0073      EIP: 080487c8  ERR: ffffffff  EFLAGS: 00010202

It was also reported by Herbert van den Bergh against 3.1-based kernel
with the following snippet from the console log.

BUG: unable to handle kernel paging request at 01c00008
IP: [<c0522399>] isolate_migratepages+0x119/0x390
*pdpt = 000000002f7ce001 *pde = 0000000000000000

It is expected that it also affects 3.2.x and current mainline.

The problem is that pfn_valid is only called on the first PFN being
checked and that PFN is not necessarily aligned.  Lets say we have a case
like this

H = MAX_ORDER_NR_PAGES boundary
| = pageblock boundary
m = cc->migrate_pfn
f = cc->free_pfn
o = memory hole

H------|------H------|----m-Hoooooo|ooooooH-f----|------H

The migrate_pfn is just below a memory hole and the free scanner is beyond
the hole.  When isolate_migratepages started, it scans from migrate_pfn to
migrate_pfn+pageblock_nr_pages which is now in a memory hole.  It checks
pfn_valid() on the first PFN but then scans into the hole where there are
not necessarily valid struct pages.

This patch ensures that isolate_migratepages calls pfn_valid when
necessary.

Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
@nomis
Copy link
Contributor Author

nomis commented May 8, 2012

It looks like proper devicetree support makes things even more complicated because the bootloader is supposed to provide that information at 0x100, which implies that the ability to load a 32KB section from a separate file would be useful.

How you're going to get the extra parameters (serial number, etc.) to the kernel in that scenario I don't know...

@nomis
Copy link
Contributor Author

nomis commented May 8, 2012

I think you could load a device tree blob from a file to 0x100 and append the ATAGs.

@bootc
Copy link
Contributor

bootc commented May 9, 2012

If I understand this correctly then, the bootloader could have a device-tree mode. If there is a 'bcm2708.dtb' file in the boot partition, its contents is loaded at 0x100 instead of the ATAGS, and it could also assume that it should load the kernel at 0x8000 and put its own code at 0x0000 as necessary to setup r0/r1/r2 for kernel entry.

That way we kill several birds with one stone: device tree booting (with a .dtb file we can edit as things evolve) as well as removing the need for a custom kernel image, and the whole thing can be managed as a single transition without needing to specify special configuration options in config.txt.

Note that I don't know enough about device-tree yet to know exactly how the bootloader should insert kernel arguments into the device tree file.

@nomis
Copy link
Contributor Author

nomis commented May 9, 2012

I'd suggest the .dtb file is named something other than bcm2708.dtb as there would need to be different files for different device models (e.g. rpi-model-b.dtb) so a generic filename, something like "kernel.dtb" would be less confusing than the hardware model.

@popcornmix
Copy link
Collaborator

popcornmix commented May 9, 2012

I don't know a lot about device tree, but if you tell me what is neeed I can arrange for a dtb file to be loaded at 0x100.
Do you want existing ATAGS (e.g. cmdline.txt) to be added after the dtb file ?

@bootc
Copy link
Contributor

bootc commented May 9, 2012

For full device tree support, we'd need the bootloader to load in the .dtb file and then edit the structure to add such things as memory size and command-line arguments added in. Device-tree should completely replace the ATAGS support.

@nomis
Copy link
Contributor Author

nomis commented May 13, 2012

It should also provide the system serial number and the MAC address.

There isn't currently any code in Linux to get the serial number from device tree even though it's a generic part of the arm arch...

@nomis
Copy link
Contributor Author

nomis commented May 14, 2012

It would also be natural for the display size/depth to be provided in the device tree instead of as module parameters.

@popcornmix
Copy link
Collaborator

popcornmix commented May 14, 2012

If you provide me with a .dtb file, and explain where useful values should be inserted (ideally provide a C code function that parses/modifies the loaded .dtb), then I'll make the GPU do it.

I've implemented the optional first32k scheme described earlier. I'll share that later tonight after some testing. It also has the UART setup.

@nomis
Copy link
Contributor Author

nomis commented May 14, 2012

The list of DMA channels should also be provided (although I've not added a suitable device to the .dts file for that yet).

I don't know that much about .dtb files, but you can find the .dts in my rpi-linear branch as arch/arm/boot/dts/bcm2835.dts (this branch will currently boot with uart0 and the display but has no usb or mmc support). [You need to provide the .dtb at 0x100 via r2 or configure it to be appended to the kernel image (and then append it yourself) otherwise it won't boot]

Running "make dtbs" will create arch/arm/boot/bcm2835.dtb if you have the BCM2708 arch configured.

This file in the kernel source is strictly a convenience template, so it would be ok to put in placeholders to aid the bootloader. Everything's be32 so it should be simple to overwrite default values as required. There's already a placeholder for the memory.

@nomis
Copy link
Contributor Author

nomis commented May 14, 2012

The memory area to use (i.e. L2 on/off) could also be provided in the device tree although I haven't looked into how the kernel could reconfigure those at runtime.

@nomis
Copy link
Contributor Author

nomis commented May 14, 2012

There's libfdt for manipulating the .dtb here: git://git.jdl.com/software/dtc.git (dual GPL/BSD so you'd be able to use it).

@popcornmix
Copy link
Collaborator

popcornmix commented May 14, 2012

So how do you want it loaded?
config.txt options:
device_tree=bcm2835.dtb (possibly that will be a default in future)
device_tree_address=0x100 (again the default)
GPU will then load it at 0x100, and overwrite the fields it knows about.
Does the code at 0x0 set r2? Should I modify that?

@nomis
Copy link
Contributor Author

nomis commented May 14, 2012

Yes, the code at 0x0 should set r2 to the device_tree_address

@popcornmix
Copy link
Collaborator

popcornmix commented May 14, 2012

Okay, a test firmware:
http://dl.dropbox.com/u/3669512/start.elf

This should support:
CONFIG_STR(device_tree)
CONFIG(device_tree_address)
CONFIG(kernel_address)
CONFIG(disable_commandline_tags)

It also sets up the UART to 115200 with a 3MHz clock as requested elsewhere. You can modify these:
CONFIG(init_uart_baud)
CONFIG(init_uart_clock)

So, as described start32k is now optional (and should be removed from kernel.img when using device tree)
You can load a device_tree with
device_tree=bcm2835.dtb
device_tree_address=0x100
r2 should be set to this address is initial code.
I don't (yet) modify the device tree

I did try parsing the device tree from GPU and it looked okay (http://pastebin.com/ty66Jies)

@nomis
Copy link
Contributor Author

nomis commented May 14, 2012

This works for me!

[ 0.000000] Machine: BCM2708, model: BCM2835
[ 0.000000] Memory: 192MB = 192MB total
[ 0.000000] bank0: VIC at 0x2000b200 (8 IRQs)
[ 0.000000] bank1: VIC at 0x2000b204 (32 IRQs)
[ 0.000000] bank2: VIC at 0x2000b208 (32 IRQs)
[ 0.000000] sched_clock: 32 bits at 1000kHz, resolution 1000ns, wraps every 4294967ms
[ 0.000000] timer1: timer at MMIO 0x20003010 (irq = 1)
[ 0.000000] timer3: timer at MMIO 0x20003018 (irq = 3)
[ 0.112351] Serial: AMBA PL011 UART driver
[ 0.113869] 20201000.uart0: ttyAMA0 at MMIO 0x20201000 (irq = 57) is a PL011 rev3
[ 0.696930] console [ttyAMA0] enabled
[ 0.703394] bcm-mbox: Broadcom VideoCore Mailbox driver
[ 0.709829] bcm-mbox 2000b880.mbox0: mailbox at MMIO 0x2000b880 (irq = 65, channels = 0x000b)
[ 0.719777] stc: 32-bit clock at MMIO 0x20003004, 1000000 Hz
[ 0.754826] Switching to clocksource stc
[ 1.964569] bcm2708_fb display.0: registering framebuffer (800x480@16)
[ 1.985666] Console: switching to colour frame buffer device 100x30
[ 2.001894] bcm2708_wdog: default timeout=16 (nowayout=0)
[ 2.012992] bcm2708_wdog 20100000.watchdog: at MMIO 0x20100000
[ 2.124425] Freeing init memory: 10304K

At that point it waits forever for mmc0 as I was using cmdline.txt instead of the default :)

@kallisti5
Copy link

kallisti5 commented May 15, 2012

aaah... just when we get Haiku uart working :) Thanks for all the hard work on this guys.

@popcornmix
Copy link
Collaborator

popcornmix commented May 16, 2012

Firmware now pushed to github. Please close issue if happy.

@nomis
Copy link
Contributor Author

nomis commented May 17, 2012

The bootloader is able to load the kernel at 0x8000 and the device tree file at 0x100 so I'll close this issue and raise another one specifically on the device tree population.

@nomis nomis closed this as completed May 17, 2012
popcornmix pushed a commit to popcornmix/linux that referenced this issue Aug 16, 2012
…d reasons

commit 5cf02d0 upstream.

We've had some reports of a deadlock where rpciod ends up with a stack
trace like this:

    PID: 2507   TASK: ffff88103691ab40  CPU: 14  COMMAND: "rpciod/14"
     #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9
     raspberrypi#1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs]
     raspberrypi#2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f
     raspberrypi#3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8
     raspberrypi#4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs]
     raspberrypi#5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs]
     raspberrypi#6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670
     raspberrypi#7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271
     raspberrypi#8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638
     raspberrypi#9 [ffff8810343bf818] shrink_zone at ffffffff8112788f
    raspberrypi#10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e
    raspberrypi#11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f
    raspberrypi#12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad
    raspberrypi#13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942
    raspberrypi#14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a
    raspberrypi#15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9
    raspberrypi#16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b
    raspberrypi#17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808
    raspberrypi#18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c
    raspberrypi#19 [ffff8810343bfce8] inet_create at ffffffff81483ba6
    raspberrypi#20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7
    raspberrypi#21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc]
    raspberrypi#22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc]
    raspberrypi#23 [ffff8810343bfe38] worker_thread at ffffffff810887d0
    raspberrypi#24 [ffff8810343bfee8] kthread at ffffffff8108dd96
    raspberrypi#25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca

rpciod is trying to allocate memory for a new socket to talk to the
server. The VM ends up calling ->releasepage to get more memory, and it
tries to do a blocking commit. That commit can't succeed however without
a connected socket, so we deadlock.

Fix this by setting PF_FSTRANS on the workqueue task prior to doing the
socket allocation, and having nfs_release_page check for that flag when
deciding whether to do a commit call. Also, set PF_FSTRANS
unconditionally in rpc_async_schedule since that function can also do
allocations sometimes.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
popcornmix pushed a commit that referenced this issue Oct 13, 2012
Printing the "start_ip" for every secondary cpu is very noisy on a large
system - and doesn't add any value. Drop this message.

Console log before:
Booting Node   0, Processors  #1
smpboot cpu 1: start_ip = 96000
 #2
smpboot cpu 2: start_ip = 96000
 #3
smpboot cpu 3: start_ip = 96000
 #4
smpboot cpu 4: start_ip = 96000
       ...
 #31
smpboot cpu 31: start_ip = 96000
Brought up 32 CPUs

Console log after:
Booting Node   0, Processors  #1 #2 #3 #4 #5 #6 #7 Ok.
Booting Node   1, Processors  #8 #9 #10 #11 #12 #13 #14 #15 Ok.
Booting Node   0, Processors  #16 #17 #18 #19 #20 #21 #22 #23 Ok.
Booting Node   1, Processors  #24 #25 #26 #27 #28 #29 #30 #31
Brought up 32 CPUs

Acked-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4f452eb42507460426@agluck-desktop.sc.intel.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
popcornmix pushed a commit that referenced this issue Oct 13, 2012
Add a parameter to avoid using MSI/MSI-X for PCIe native hotplug; it's
known to be buggy on some platforms.

In my environment, while shutting down, following stack trace is shown
sometimes.

  irq 16: nobody cared (try booting with the "irqpoll" option)
  Pid: 1081, comm: reboot Not tainted 3.2.0 #1
  Call Trace:
   <IRQ>  [<ffffffff810cec1d>] __report_bad_irq+0x3d/0xe0
   [<ffffffff810cee1c>] note_interrupt+0x15c/0x210
   [<ffffffff810cc485>] handle_irq_event_percpu+0xb5/0x210
   [<ffffffff810cc621>] handle_irq_event+0x41/0x70
   [<ffffffff810cf675>] handle_fasteoi_irq+0x55/0xc0
   [<ffffffff81015356>] handle_irq+0x46/0xb0
   [<ffffffff814fbe9d>] do_IRQ+0x5d/0xe0
   [<ffffffff814f146e>] common_interrupt+0x6e/0x6e
   [<ffffffff8106b040>] ? __do_softirq+0x60/0x210
   [<ffffffff8108aeb1>] ? hrtimer_interrupt+0x151/0x240
   [<ffffffff814fb5ec>] call_softirq+0x1c/0x30
   [<ffffffff810152d5>] do_softirq+0x65/0xa0
   [<ffffffff8106ae9d>] irq_exit+0xbd/0xe0
   [<ffffffff814fbf8e>] smp_apic_timer_interrupt+0x6e/0x99
   [<ffffffff814f9e5e>] apic_timer_interrupt+0x6e/0x80
   <EOI>  [<ffffffff814f0fb1>] ? _raw_spin_unlock_irqrestore+0x11/0x20
   [<ffffffff812629fc>] pci_bus_write_config_word+0x6c/0x80
   [<ffffffff81266fc2>] pci_intx+0x52/0xa0
   [<ffffffff8127de3d>] pci_intx_for_msi+0x1d/0x30
  [<ffffffff8127e4fb>] pci_msi_shutdown+0x7b/0x110
   [<ffffffff81269d34>] pci_device_shutdown+0x34/0x50
   [<ffffffff81326c4f>] device_shutdown+0x2f/0x140
   [<ffffffff8107b981>] kernel_restart_prepare+0x31/0x40
   [<ffffffff8107b9e6>] kernel_restart+0x16/0x60
   [<ffffffff8107bbfd>] sys_reboot+0x1ad/0x220
   [<ffffffff814f4b90>] ? do_page_fault+0x1e0/0x460
   [<ffffffff811942d0>] ? __sync_filesystem+0x90/0x90
   [<ffffffff8105c9aa>] ? __cond_resched+0x2a/0x40
   [<ffffffff814ef090>] ? _cond_resched+0x30/0x40
   [<ffffffff81169e17>] ? iterate_supers+0xb7/0xd0
   [<ffffffff814f9382>] system_call_fastpath+0x16/0x1b
  handlers:
  [<ffffffff8138a0f0>] usb_hcd_irq
  [<ffffffff8138a0f0>] usb_hcd_irq
  [<ffffffff8138a0f0>] usb_hcd_irq
  Disabling IRQ #16

An un-wanted interrupt is generated when PCI driver switches from
MSI/MSI-X to INTx while shutting down the device.  The interrupt does
not happen if MSI/MSI-X is not used on the device.
I confirmed that this problem does not happen if pcie_hp=nomsi was
specified and hotplug operation worked fine as usual.

v2: Automatically disable MSI/MSI-X against following device:
    PCI bridge: Integrated Device Technology, Inc. Device 807f (rev 02)
v3: Based on the review comment, combile the if statements.
v4: Removed module parameter.
    Move some code to build pciehp as a module.
    Move device specific code to driver/pci/quirks.c.
v5: Drop a device specific code until getting a vendor statement.

Reviewed-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com>
Signed-off-by: MUNEDA Takahiro <muneda.takahiro@jp.fujitsu.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
popcornmix pushed a commit that referenced this issue Oct 13, 2012
Fixes a lockdep warning:

===================================================
[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
net/mac80211/agg-rx.c:148 invoked rcu_dereference_check() without protection!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 1
1 lock held by arecord/11226:
 #0:  (&tid_agg_rx->session_timer){+.-...}, at: [<ffffffff81066bb0>] call_timer_fn+0x0/0x360

stack backtrace:
Pid: 11226, comm: arecord Not tainted 3.1.0-kml #16
Call Trace:
 <IRQ>  [<ffffffff81093454>] lockdep_rcu_dereference+0xa4/0xc0
 [<ffffffffa02778c9>] sta_rx_agg_session_timer_expired+0xc9/0x110 [mac80211]
 [<ffffffffa0277800>] ? ieee80211_process_addba_resp+0x220/0x220 [mac80211]
 [<ffffffff81066c3a>] call_timer_fn+0x8a/0x360
 [<ffffffff81066bb0>] ? init_timer_deferrable_key+0x30/0x30
 [<ffffffff81477bb0>] ? _raw_spin_unlock_irq+0x30/0x70
 [<ffffffff81067049>] run_timer_softirq+0x139/0x310
 [<ffffffff81091d5e>] ? put_lock_stats.isra.25+0xe/0x40
 [<ffffffff810922ac>] ? lock_release_holdtime.part.26+0xdc/0x160
 [<ffffffffa0277800>] ? ieee80211_process_addba_resp+0x220/0x220 [mac80211]
 [<ffffffff8105cb78>] __do_softirq+0xc8/0x3c0
 [<ffffffff8108f088>] ? tick_dev_program_event+0x48/0x110
 [<ffffffff8108f16f>] ? tick_program_event+0x1f/0x30
 [<ffffffff81153b15>] ? putname+0x35/0x50
 [<ffffffff8147a43c>] call_softirq+0x1c/0x30
 [<ffffffff81004c55>] do_softirq+0xa5/0xe0
 [<ffffffff8105d1ee>] irq_exit+0xae/0xe0
 [<ffffffff8147ac6b>] smp_apic_timer_interrupt+0x6b/0x98
 [<ffffffff81479ab3>] apic_timer_interrupt+0x73/0x80
 <EOI>  [<ffffffff8146aac6>] ? free_debug_processing+0x1a1/0x1d5
 [<ffffffff81153b15>] ? putname+0x35/0x50
 [<ffffffff8146ab2b>] __slab_free+0x31/0x2ca
 [<ffffffff81477c3a>] ? _raw_spin_unlock_irqrestore+0x4a/0x90
 [<ffffffff81253b8f>] ? __debug_check_no_obj_freed+0x15f/0x210
 [<ffffffff81097054>] ? lock_release_nested+0x84/0xc0
 [<ffffffff8113ec55>] ? kmem_cache_free+0x105/0x250
 [<ffffffff81153b15>] ? putname+0x35/0x50
 [<ffffffff81153b15>] ? putname+0x35/0x50
 [<ffffffff8113ed8f>] kmem_cache_free+0x23f/0x250
 [<ffffffff81153b15>] putname+0x35/0x50
 [<ffffffff81146d8d>] do_sys_open+0x16d/0x1d0
 [<ffffffff81146e10>] sys_open+0x20/0x30
 [<ffffffff81478f42>] system_call_fastpath+0x16/0x1b

Reported-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
popcornmix pushed a commit that referenced this issue Oct 22, 2021
commit 19ea40d upstream.

[BUG]
There is a bug report that injected ENOMEM error could leave a tree
block locked while we return to user-space:

  BTRFS info (device loop0): enabling ssd optimizations
  FAULT_INJECTION: forcing a failure.
  name failslab, interval 1, probability 0, space 0, times 0
  CPU: 0 PID: 7579 Comm: syz-executor Not tainted 5.15.0-rc1 #16
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
  rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
  Call Trace:
   __dump_stack lib/dump_stack.c:88 [inline]
   dump_stack_lvl+0x8d/0xcf lib/dump_stack.c:106
   fail_dump lib/fault-inject.c:52 [inline]
   should_fail+0x13c/0x160 lib/fault-inject.c:146
   should_failslab+0x5/0x10 mm/slab_common.c:1328
   slab_pre_alloc_hook.constprop.99+0x4e/0xc0 mm/slab.h:494
   slab_alloc_node mm/slub.c:3120 [inline]
   slab_alloc mm/slub.c:3214 [inline]
   kmem_cache_alloc+0x44/0x280 mm/slub.c:3219
   btrfs_alloc_delayed_extent_op fs/btrfs/delayed-ref.h:299 [inline]
   btrfs_alloc_tree_block+0x38c/0x670 fs/btrfs/extent-tree.c:4833
   __btrfs_cow_block+0x16f/0x7d0 fs/btrfs/ctree.c:415
   btrfs_cow_block+0x12a/0x300 fs/btrfs/ctree.c:570
   btrfs_search_slot+0x6b0/0xee0 fs/btrfs/ctree.c:1768
   btrfs_insert_empty_items+0x80/0xf0 fs/btrfs/ctree.c:3905
   btrfs_new_inode+0x311/0xa60 fs/btrfs/inode.c:6530
   btrfs_create+0x12b/0x270 fs/btrfs/inode.c:6783
   lookup_open+0x660/0x780 fs/namei.c:3282
   open_last_lookups fs/namei.c:3352 [inline]
   path_openat+0x465/0xe20 fs/namei.c:3557
   do_filp_open+0xe3/0x170 fs/namei.c:3588
   do_sys_openat2+0x357/0x4a0 fs/open.c:1200
   do_sys_open+0x87/0xd0 fs/open.c:1216
   do_syscall_x64 arch/x86/entry/common.c:50 [inline]
   do_syscall_64+0x34/0xb0 arch/x86/entry/common.c:80
   entry_SYSCALL_64_after_hwframe+0x44/0xae
  RIP: 0033:0x46ae99
  Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48
  89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
  01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
  RSP: 002b:00007f46711b9c48 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
  RAX: ffffffffffffffda RBX: 000000000078c0a0 RCX: 000000000046ae99
  RDX: 0000000000000000 RSI: 00000000000000a1 RDI: 0000000020005800
  RBP: 00007f46711b9c80 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000017
  R13: 0000000000000000 R14: 000000000078c0a0 R15: 00007ffc129da6e0

  ================================================
  WARNING: lock held when returning to user space!
  5.15.0-rc1 #16 Not tainted
  ------------------------------------------------
  syz-executor/7579 is leaving the kernel with locks still held!
  1 lock held by syz-executor/7579:
   #0: ffff888104b73da8 (btrfs-tree-01/1){+.+.}-{3:3}, at:
  __btrfs_tree_lock+0x2e/0x1a0 fs/btrfs/locking.c:112

[CAUSE]
In btrfs_alloc_tree_block(), after btrfs_init_new_buffer(), the new
extent buffer @buf is locked, but if later operations like adding
delayed tree ref fail, we just free @buf without unlocking it,
resulting above warning.

[FIX]
Unlock @buf in out_free_buf: label.

Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsZ9O6Zr0KK1yGn=1rQi6Crh1yeCRdTSBxx9R99L4xdn-Q@mail.gmail.com/
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
popcornmix pushed a commit that referenced this issue Oct 25, 2021
commit 19ea40d upstream.

[BUG]
There is a bug report that injected ENOMEM error could leave a tree
block locked while we return to user-space:

  BTRFS info (device loop0): enabling ssd optimizations
  FAULT_INJECTION: forcing a failure.
  name failslab, interval 1, probability 0, space 0, times 0
  CPU: 0 PID: 7579 Comm: syz-executor Not tainted 5.15.0-rc1 #16
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
  rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
  Call Trace:
   __dump_stack lib/dump_stack.c:88 [inline]
   dump_stack_lvl+0x8d/0xcf lib/dump_stack.c:106
   fail_dump lib/fault-inject.c:52 [inline]
   should_fail+0x13c/0x160 lib/fault-inject.c:146
   should_failslab+0x5/0x10 mm/slab_common.c:1328
   slab_pre_alloc_hook.constprop.99+0x4e/0xc0 mm/slab.h:494
   slab_alloc_node mm/slub.c:3120 [inline]
   slab_alloc mm/slub.c:3214 [inline]
   kmem_cache_alloc+0x44/0x280 mm/slub.c:3219
   btrfs_alloc_delayed_extent_op fs/btrfs/delayed-ref.h:299 [inline]
   btrfs_alloc_tree_block+0x38c/0x670 fs/btrfs/extent-tree.c:4833
   __btrfs_cow_block+0x16f/0x7d0 fs/btrfs/ctree.c:415
   btrfs_cow_block+0x12a/0x300 fs/btrfs/ctree.c:570
   btrfs_search_slot+0x6b0/0xee0 fs/btrfs/ctree.c:1768
   btrfs_insert_empty_items+0x80/0xf0 fs/btrfs/ctree.c:3905
   btrfs_new_inode+0x311/0xa60 fs/btrfs/inode.c:6530
   btrfs_create+0x12b/0x270 fs/btrfs/inode.c:6783
   lookup_open+0x660/0x780 fs/namei.c:3282
   open_last_lookups fs/namei.c:3352 [inline]
   path_openat+0x465/0xe20 fs/namei.c:3557
   do_filp_open+0xe3/0x170 fs/namei.c:3588
   do_sys_openat2+0x357/0x4a0 fs/open.c:1200
   do_sys_open+0x87/0xd0 fs/open.c:1216
   do_syscall_x64 arch/x86/entry/common.c:50 [inline]
   do_syscall_64+0x34/0xb0 arch/x86/entry/common.c:80
   entry_SYSCALL_64_after_hwframe+0x44/0xae
  RIP: 0033:0x46ae99
  Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48
  89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
  01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
  RSP: 002b:00007f46711b9c48 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
  RAX: ffffffffffffffda RBX: 000000000078c0a0 RCX: 000000000046ae99
  RDX: 0000000000000000 RSI: 00000000000000a1 RDI: 0000000020005800
  RBP: 00007f46711b9c80 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000017
  R13: 0000000000000000 R14: 000000000078c0a0 R15: 00007ffc129da6e0

  ================================================
  WARNING: lock held when returning to user space!
  5.15.0-rc1 #16 Not tainted
  ------------------------------------------------
  syz-executor/7579 is leaving the kernel with locks still held!
  1 lock held by syz-executor/7579:
   #0: ffff888104b73da8 (btrfs-tree-01/1){+.+.}-{3:3}, at:
  __btrfs_tree_lock+0x2e/0x1a0 fs/btrfs/locking.c:112

[CAUSE]
In btrfs_alloc_tree_block(), after btrfs_init_new_buffer(), the new
extent buffer @buf is locked, but if later operations like adding
delayed tree ref fail, we just free @buf without unlocking it,
resulting above warning.

[FIX]
Unlock @buf in out_free_buf: label.

Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsZ9O6Zr0KK1yGn=1rQi6Crh1yeCRdTSBxx9R99L4xdn-Q@mail.gmail.com/
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
popcornmix pushed a commit that referenced this issue Nov 25, 2021
[ Upstream commit 54659ca ]

when turning off a connection, lockdep complains with the
following warning (a modprobe has been done but the same
happens with a disconnection from NetworkManager,
it's enough to trigger a cfg80211_disconnect call):

[  682.855867] ======================================================
[  682.855877] WARNING: possible circular locking dependency detected
[  682.855887] 5.14.0-rc6+ #16 Tainted: G         C OE
[  682.855898] ------------------------------------------------------
[  682.855906] modprobe/1770 is trying to acquire lock:
[  682.855916] ffffb6d000332b00 (&pxmitpriv->lock){+.-.}-{2:2},
		at: rtw_free_stainfo+0x52/0x4a0 [r8723bs]
[  682.856073]
               but task is already holding lock:
[  682.856081] ffffb6d0003336a8 (&pstapriv->sta_hash_lock){+.-.}-{2:2},
		at: rtw_free_assoc_resources+0x48/0x110 [r8723bs]
[  682.856207]
               which lock already depends on the new lock.

[  682.856215]
               the existing dependency chain (in reverse order) is:
[  682.856223]
               -> #1 (&pstapriv->sta_hash_lock){+.-.}-{2:2}:
[  682.856247]        _raw_spin_lock_bh+0x34/0x40
[  682.856265]        rtw_get_stainfo+0x9a/0x110 [r8723bs]
[  682.856389]        rtw_xmit_classifier+0x27/0x130 [r8723bs]
[  682.856515]        rtw_xmitframe_enqueue+0xa/0x20 [r8723bs]
[  682.856642]        rtl8723bs_hal_xmit+0x3b/0xb0 [r8723bs]
[  682.856752]        rtw_xmit+0x4ef/0x890 [r8723bs]
[  682.856879]        _rtw_xmit_entry+0xba/0x350 [r8723bs]
[  682.856981]        dev_hard_start_xmit+0xee/0x320
[  682.856999]        sch_direct_xmit+0x8c/0x330
[  682.857014]        __dev_queue_xmit+0xba5/0xf00
[  682.857030]        packet_sendmsg+0x981/0x1b80
[  682.857047]        sock_sendmsg+0x5b/0x60
[  682.857060]        __sys_sendto+0xf1/0x160
[  682.857073]        __x64_sys_sendto+0x24/0x30
[  682.857087]        do_syscall_64+0x3a/0x80
[  682.857102]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  682.857117]
               -> #0 (&pxmitpriv->lock){+.-.}-{2:2}:
[  682.857142]        __lock_acquire+0xfd9/0x1b50
[  682.857158]        lock_acquire+0xb4/0x2c0
[  682.857172]        _raw_spin_lock_bh+0x34/0x40
[  682.857185]        rtw_free_stainfo+0x52/0x4a0 [r8723bs]
[  682.857308]        rtw_free_assoc_resources+0x53/0x110 [r8723bs]
[  682.857415]        cfg80211_rtw_disconnect+0x4b/0x70 [r8723bs]
[  682.857522]        cfg80211_disconnect+0x12e/0x2f0 [cfg80211]
[  682.857759]        cfg80211_leave+0x2b/0x40 [cfg80211]
[  682.857961]        cfg80211_netdev_notifier_call+0xa9/0x560 [cfg80211]
[  682.858163]        raw_notifier_call_chain+0x41/0x50
[  682.858180]        __dev_close_many+0x62/0x100
[  682.858195]        dev_close_many+0x7d/0x120
[  682.858209]        unregister_netdevice_many+0x416/0x680
[  682.858225]        unregister_netdevice_queue+0xab/0xf0
[  682.858240]        unregister_netdev+0x18/0x20
[  682.858255]        rtw_unregister_netdevs+0x28/0x40 [r8723bs]
[  682.858360]        rtw_dev_remove+0x24/0xd0 [r8723bs]
[  682.858463]        sdio_bus_remove+0x31/0xd0 [mmc_core]
[  682.858532]        device_release_driver_internal+0xf7/0x1d0
[  682.858550]        driver_detach+0x47/0x90
[  682.858564]        bus_remove_driver+0x77/0xd0
[  682.858579]        rtw_drv_halt+0xc/0x678 [r8723bs]
[  682.858685]        __x64_sys_delete_module+0x13f/0x250
[  682.858699]        do_syscall_64+0x3a/0x80
[  682.858715]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  682.858729]
               other info that might help us debug this:

[  682.858737]  Possible unsafe locking scenario:

[  682.858744]        CPU0                    CPU1
[  682.858751]        ----                    ----
[  682.858758]   lock(&pstapriv->sta_hash_lock);
[  682.858772]                                lock(&pxmitpriv->lock);
[  682.858786]                                lock(&pstapriv->sta_hash_lock);
[  682.858799]   lock(&pxmitpriv->lock);
[  682.858812]
                *** DEADLOCK ***

[  682.858820] 5 locks held by modprobe/1770:
[  682.858831]  #0: ffff8d870697d980 (&dev->mutex){....}-{3:3},
		at: device_release_driver_internal+0x1a/0x1d0
[  682.858869]  #1: ffffffffbdbbf1c8 (rtnl_mutex){+.+.}-{3:3},
		at: unregister_netdev+0xe/0x20
[  682.858906]  #2: ffff8d87054ee5e8 (&rdev->wiphy.mtx){+.+.}-{3:3},
		at: cfg80211_netdev_notifier_call+0x9e/0x560 [cfg80211]
[  682.859131]  #3: ffff8d870f2bc8f0 (&wdev->mtx){+.+.}-{3:3},
		at: cfg80211_leave+0x20/0x40 [cfg80211]
[  682.859354]  #4: ffffb6d0003336a8 (&pstapriv->sta_hash_lock){+.-.}-{2:2},
		at: rtw_free_assoc_resources+0x48/0x110 [r8723bs]
[  682.859482]
               stack backtrace:
[  682.859491] CPU: 1 PID: 1770 Comm: modprobe Tainted: G
		C OE     5.14.0-rc6+ #16
[  682.859507] Hardware name: LENOVO 80NR/Madrid, BIOS DACN25WW 08/20/2015
[  682.859517] Call Trace:
[  682.859531]  dump_stack_lvl+0x56/0x6f
[  682.859551]  check_noncircular+0xdb/0xf0
[  682.859579]  __lock_acquire+0xfd9/0x1b50
[  682.859606]  lock_acquire+0xb4/0x2c0
[  682.859623]  ? rtw_free_stainfo+0x52/0x4a0 [r8723bs]
[  682.859752]  ? mark_held_locks+0x48/0x70
[  682.859769]  ? rtw_free_stainfo+0x4a/0x4a0 [r8723bs]
[  682.859898]  _raw_spin_lock_bh+0x34/0x40
[  682.859914]  ? rtw_free_stainfo+0x52/0x4a0 [r8723bs]
[  682.860039]  rtw_free_stainfo+0x52/0x4a0 [r8723bs]
[  682.860171]  rtw_free_assoc_resources+0x53/0x110 [r8723bs]
[  682.860286]  cfg80211_rtw_disconnect+0x4b/0x70 [r8723bs]
[  682.860397]  cfg80211_disconnect+0x12e/0x2f0 [cfg80211]
[  682.860629]  cfg80211_leave+0x2b/0x40 [cfg80211]
[  682.860836]  cfg80211_netdev_notifier_call+0xa9/0x560 [cfg80211]
[  682.861048]  ? __lock_acquire+0x4dc/0x1b50
[  682.861070]  ? lock_is_held_type+0xa8/0x110
[  682.861089]  ? lock_is_held_type+0xa8/0x110
[  682.861104]  ? find_held_lock+0x2d/0x90
[  682.861120]  ? packet_notifier+0x173/0x300
[  682.861141]  ? lock_release+0xb3/0x250
[  682.861160]  ? packet_notifier+0x192/0x300
[  682.861184]  raw_notifier_call_chain+0x41/0x50
[  682.861205]  __dev_close_many+0x62/0x100
[  682.861224]  dev_close_many+0x7d/0x120
[  682.861245]  unregister_netdevice_many+0x416/0x680
[  682.861264]  ? find_held_lock+0x2d/0x90
[  682.861284]  unregister_netdevice_queue+0xab/0xf0
[  682.861306]  unregister_netdev+0x18/0x20
[  682.861325]  rtw_unregister_netdevs+0x28/0x40 [r8723bs]
[  682.861434]  rtw_dev_remove+0x24/0xd0 [r8723bs]
[  682.861542]  sdio_bus_remove+0x31/0xd0 [mmc_core]
[  682.861615]  device_release_driver_internal+0xf7/0x1d0
[  682.861637]  driver_detach+0x47/0x90
[  682.861656]  bus_remove_driver+0x77/0xd0
[  682.861674]  rtw_drv_halt+0xc/0x678 [r8723bs]
[  682.861782]  __x64_sys_delete_module+0x13f/0x250
[  682.861801]  ? lockdep_hardirqs_on_prepare+0xf3/0x170
[  682.861817]  ? syscall_enter_from_user_mode+0x20/0x70
[  682.861836]  do_syscall_64+0x3a/0x80
[  682.861855]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  682.861873] RIP: 0033:0x7f6dbe85400b
[  682.861890] Code: 73 01 c3 48 8b 0d 6d 1e 0c 00 f7 d8 64 89
01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa
b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 3d
1e 0c 00 f7 d8 64 89 01 48
[  682.861906] RSP: 002b:00007ffe7a82f538 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[  682.861923] RAX: ffffffffffffffda RBX: 000055a64693bd20 RCX: 00007f6dbe85400b
[  682.861935] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055a64693bd88
[  682.861946] RBP: 000055a64693bd20 R08: 0000000000000000 R09: 0000000000000000
[  682.861957] R10: 00007f6dbe8c7ac0 R11: 0000000000000206 R12: 000055a64693bd88
[  682.861967] R13: 0000000000000000 R14: 000055a64693bd88 R15: 00007ffe7a831848

This happens because when we enqueue a frame for
transmission we do it under xmit_priv lock, then calling
rtw_get_stainfo (needed for enqueuing) takes sta_hash_lock
and this leads to the following lock dependency:

xmit_priv->lock -> sta_hash_lock

Turning off a connection will bring to call
rtw_free_assoc_resources which will set up
the inverse dependency:

sta_hash_lock -> xmit_priv_lock

This could lead to a deadlock as lockdep complains.

Fix it by removing the xmit_priv->lock around
rtw_xmitframe_enqueue call inside rtl8723bs_hal_xmit
and put it in a smaller critical section inside
rtw_xmit_classifier, the only place where
xmit_priv data are actually accessed.

Replace spin_{lock,unlock}_bh(pxmitpriv->lock)
in other tx paths leading to rtw_xmitframe_enqueue
call with spin_{lock,unlock}_bh(psta->sleep_q.lock)
- it's not clear why accessing a sleep_q was protected
by a spinlock on xmitpriv->lock.

This way is avoided the same faulty lock nesting
order.

Extra changes in v2 by Hans de Goede:
-Lift the taking of the struct __queue.lock spinlock out of
 rtw_free_xmitframe_queue() into the callers this allows also
 protecting a bunch of related state in rtw_free_stainfo():
-Protect psta->sleepq_len on rtw_free_xmitframe_queue(&psta->sleep_q);
-Protect struct tx_servq.tx_pending and tx_servq.qcnt when
 calling rtw_free_xmitframe_queue(&tx_servq.sta_pending)
-This also allows moving the spin_lock_bh(&pxmitpriv->lock); to below
 the sleep_q free-ing code, avoiding another ABBA locking issue

CC: Larry Finger <Larry.Finger@lwfinger.net>
Co-developed-by: Hans de Goede <hdegoede@redhat.com>
Tested-on: Lenovo Ideapad MiiX 300-10IBY
Signed-off-by: Fabio Aiuto <fabioaiuto83@gmail.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lore.kernel.org/r/20210920145502.155454-1-hdegoede@redhat.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
popcornmix pushed a commit that referenced this issue Nov 29, 2021
[ Upstream commit 54659ca ]

when turning off a connection, lockdep complains with the
following warning (a modprobe has been done but the same
happens with a disconnection from NetworkManager,
it's enough to trigger a cfg80211_disconnect call):

[  682.855867] ======================================================
[  682.855877] WARNING: possible circular locking dependency detected
[  682.855887] 5.14.0-rc6+ #16 Tainted: G         C OE
[  682.855898] ------------------------------------------------------
[  682.855906] modprobe/1770 is trying to acquire lock:
[  682.855916] ffffb6d000332b00 (&pxmitpriv->lock){+.-.}-{2:2},
		at: rtw_free_stainfo+0x52/0x4a0 [r8723bs]
[  682.856073]
               but task is already holding lock:
[  682.856081] ffffb6d0003336a8 (&pstapriv->sta_hash_lock){+.-.}-{2:2},
		at: rtw_free_assoc_resources+0x48/0x110 [r8723bs]
[  682.856207]
               which lock already depends on the new lock.

[  682.856215]
               the existing dependency chain (in reverse order) is:
[  682.856223]
               -> #1 (&pstapriv->sta_hash_lock){+.-.}-{2:2}:
[  682.856247]        _raw_spin_lock_bh+0x34/0x40
[  682.856265]        rtw_get_stainfo+0x9a/0x110 [r8723bs]
[  682.856389]        rtw_xmit_classifier+0x27/0x130 [r8723bs]
[  682.856515]        rtw_xmitframe_enqueue+0xa/0x20 [r8723bs]
[  682.856642]        rtl8723bs_hal_xmit+0x3b/0xb0 [r8723bs]
[  682.856752]        rtw_xmit+0x4ef/0x890 [r8723bs]
[  682.856879]        _rtw_xmit_entry+0xba/0x350 [r8723bs]
[  682.856981]        dev_hard_start_xmit+0xee/0x320
[  682.856999]        sch_direct_xmit+0x8c/0x330
[  682.857014]        __dev_queue_xmit+0xba5/0xf00
[  682.857030]        packet_sendmsg+0x981/0x1b80
[  682.857047]        sock_sendmsg+0x5b/0x60
[  682.857060]        __sys_sendto+0xf1/0x160
[  682.857073]        __x64_sys_sendto+0x24/0x30
[  682.857087]        do_syscall_64+0x3a/0x80
[  682.857102]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  682.857117]
               -> #0 (&pxmitpriv->lock){+.-.}-{2:2}:
[  682.857142]        __lock_acquire+0xfd9/0x1b50
[  682.857158]        lock_acquire+0xb4/0x2c0
[  682.857172]        _raw_spin_lock_bh+0x34/0x40
[  682.857185]        rtw_free_stainfo+0x52/0x4a0 [r8723bs]
[  682.857308]        rtw_free_assoc_resources+0x53/0x110 [r8723bs]
[  682.857415]        cfg80211_rtw_disconnect+0x4b/0x70 [r8723bs]
[  682.857522]        cfg80211_disconnect+0x12e/0x2f0 [cfg80211]
[  682.857759]        cfg80211_leave+0x2b/0x40 [cfg80211]
[  682.857961]        cfg80211_netdev_notifier_call+0xa9/0x560 [cfg80211]
[  682.858163]        raw_notifier_call_chain+0x41/0x50
[  682.858180]        __dev_close_many+0x62/0x100
[  682.858195]        dev_close_many+0x7d/0x120
[  682.858209]        unregister_netdevice_many+0x416/0x680
[  682.858225]        unregister_netdevice_queue+0xab/0xf0
[  682.858240]        unregister_netdev+0x18/0x20
[  682.858255]        rtw_unregister_netdevs+0x28/0x40 [r8723bs]
[  682.858360]        rtw_dev_remove+0x24/0xd0 [r8723bs]
[  682.858463]        sdio_bus_remove+0x31/0xd0 [mmc_core]
[  682.858532]        device_release_driver_internal+0xf7/0x1d0
[  682.858550]        driver_detach+0x47/0x90
[  682.858564]        bus_remove_driver+0x77/0xd0
[  682.858579]        rtw_drv_halt+0xc/0x678 [r8723bs]
[  682.858685]        __x64_sys_delete_module+0x13f/0x250
[  682.858699]        do_syscall_64+0x3a/0x80
[  682.858715]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  682.858729]
               other info that might help us debug this:

[  682.858737]  Possible unsafe locking scenario:

[  682.858744]        CPU0                    CPU1
[  682.858751]        ----                    ----
[  682.858758]   lock(&pstapriv->sta_hash_lock);
[  682.858772]                                lock(&pxmitpriv->lock);
[  682.858786]                                lock(&pstapriv->sta_hash_lock);
[  682.858799]   lock(&pxmitpriv->lock);
[  682.858812]
                *** DEADLOCK ***

[  682.858820] 5 locks held by modprobe/1770:
[  682.858831]  #0: ffff8d870697d980 (&dev->mutex){....}-{3:3},
		at: device_release_driver_internal+0x1a/0x1d0
[  682.858869]  #1: ffffffffbdbbf1c8 (rtnl_mutex){+.+.}-{3:3},
		at: unregister_netdev+0xe/0x20
[  682.858906]  #2: ffff8d87054ee5e8 (&rdev->wiphy.mtx){+.+.}-{3:3},
		at: cfg80211_netdev_notifier_call+0x9e/0x560 [cfg80211]
[  682.859131]  #3: ffff8d870f2bc8f0 (&wdev->mtx){+.+.}-{3:3},
		at: cfg80211_leave+0x20/0x40 [cfg80211]
[  682.859354]  #4: ffffb6d0003336a8 (&pstapriv->sta_hash_lock){+.-.}-{2:2},
		at: rtw_free_assoc_resources+0x48/0x110 [r8723bs]
[  682.859482]
               stack backtrace:
[  682.859491] CPU: 1 PID: 1770 Comm: modprobe Tainted: G
		C OE     5.14.0-rc6+ #16
[  682.859507] Hardware name: LENOVO 80NR/Madrid, BIOS DACN25WW 08/20/2015
[  682.859517] Call Trace:
[  682.859531]  dump_stack_lvl+0x56/0x6f
[  682.859551]  check_noncircular+0xdb/0xf0
[  682.859579]  __lock_acquire+0xfd9/0x1b50
[  682.859606]  lock_acquire+0xb4/0x2c0
[  682.859623]  ? rtw_free_stainfo+0x52/0x4a0 [r8723bs]
[  682.859752]  ? mark_held_locks+0x48/0x70
[  682.859769]  ? rtw_free_stainfo+0x4a/0x4a0 [r8723bs]
[  682.859898]  _raw_spin_lock_bh+0x34/0x40
[  682.859914]  ? rtw_free_stainfo+0x52/0x4a0 [r8723bs]
[  682.860039]  rtw_free_stainfo+0x52/0x4a0 [r8723bs]
[  682.860171]  rtw_free_assoc_resources+0x53/0x110 [r8723bs]
[  682.860286]  cfg80211_rtw_disconnect+0x4b/0x70 [r8723bs]
[  682.860397]  cfg80211_disconnect+0x12e/0x2f0 [cfg80211]
[  682.860629]  cfg80211_leave+0x2b/0x40 [cfg80211]
[  682.860836]  cfg80211_netdev_notifier_call+0xa9/0x560 [cfg80211]
[  682.861048]  ? __lock_acquire+0x4dc/0x1b50
[  682.861070]  ? lock_is_held_type+0xa8/0x110
[  682.861089]  ? lock_is_held_type+0xa8/0x110
[  682.861104]  ? find_held_lock+0x2d/0x90
[  682.861120]  ? packet_notifier+0x173/0x300
[  682.861141]  ? lock_release+0xb3/0x250
[  682.861160]  ? packet_notifier+0x192/0x300
[  682.861184]  raw_notifier_call_chain+0x41/0x50
[  682.861205]  __dev_close_many+0x62/0x100
[  682.861224]  dev_close_many+0x7d/0x120
[  682.861245]  unregister_netdevice_many+0x416/0x680
[  682.861264]  ? find_held_lock+0x2d/0x90
[  682.861284]  unregister_netdevice_queue+0xab/0xf0
[  682.861306]  unregister_netdev+0x18/0x20
[  682.861325]  rtw_unregister_netdevs+0x28/0x40 [r8723bs]
[  682.861434]  rtw_dev_remove+0x24/0xd0 [r8723bs]
[  682.861542]  sdio_bus_remove+0x31/0xd0 [mmc_core]
[  682.861615]  device_release_driver_internal+0xf7/0x1d0
[  682.861637]  driver_detach+0x47/0x90
[  682.861656]  bus_remove_driver+0x77/0xd0
[  682.861674]  rtw_drv_halt+0xc/0x678 [r8723bs]
[  682.861782]  __x64_sys_delete_module+0x13f/0x250
[  682.861801]  ? lockdep_hardirqs_on_prepare+0xf3/0x170
[  682.861817]  ? syscall_enter_from_user_mode+0x20/0x70
[  682.861836]  do_syscall_64+0x3a/0x80
[  682.861855]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  682.861873] RIP: 0033:0x7f6dbe85400b
[  682.861890] Code: 73 01 c3 48 8b 0d 6d 1e 0c 00 f7 d8 64 89
01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa
b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 3d
1e 0c 00 f7 d8 64 89 01 48
[  682.861906] RSP: 002b:00007ffe7a82f538 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[  682.861923] RAX: ffffffffffffffda RBX: 000055a64693bd20 RCX: 00007f6dbe85400b
[  682.861935] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055a64693bd88
[  682.861946] RBP: 000055a64693bd20 R08: 0000000000000000 R09: 0000000000000000
[  682.861957] R10: 00007f6dbe8c7ac0 R11: 0000000000000206 R12: 000055a64693bd88
[  682.861967] R13: 0000000000000000 R14: 000055a64693bd88 R15: 00007ffe7a831848

This happens because when we enqueue a frame for
transmission we do it under xmit_priv lock, then calling
rtw_get_stainfo (needed for enqueuing) takes sta_hash_lock
and this leads to the following lock dependency:

xmit_priv->lock -> sta_hash_lock

Turning off a connection will bring to call
rtw_free_assoc_resources which will set up
the inverse dependency:

sta_hash_lock -> xmit_priv_lock

This could lead to a deadlock as lockdep complains.

Fix it by removing the xmit_priv->lock around
rtw_xmitframe_enqueue call inside rtl8723bs_hal_xmit
and put it in a smaller critical section inside
rtw_xmit_classifier, the only place where
xmit_priv data are actually accessed.

Replace spin_{lock,unlock}_bh(pxmitpriv->lock)
in other tx paths leading to rtw_xmitframe_enqueue
call with spin_{lock,unlock}_bh(psta->sleep_q.lock)
- it's not clear why accessing a sleep_q was protected
by a spinlock on xmitpriv->lock.

This way is avoided the same faulty lock nesting
order.

Extra changes in v2 by Hans de Goede:
-Lift the taking of the struct __queue.lock spinlock out of
 rtw_free_xmitframe_queue() into the callers this allows also
 protecting a bunch of related state in rtw_free_stainfo():
-Protect psta->sleepq_len on rtw_free_xmitframe_queue(&psta->sleep_q);
-Protect struct tx_servq.tx_pending and tx_servq.qcnt when
 calling rtw_free_xmitframe_queue(&tx_servq.sta_pending)
-This also allows moving the spin_lock_bh(&pxmitpriv->lock); to below
 the sleep_q free-ing code, avoiding another ABBA locking issue

CC: Larry Finger <Larry.Finger@lwfinger.net>
Co-developed-by: Hans de Goede <hdegoede@redhat.com>
Tested-on: Lenovo Ideapad MiiX 300-10IBY
Signed-off-by: Fabio Aiuto <fabioaiuto83@gmail.com>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Link: https://lore.kernel.org/r/20210920145502.155454-1-hdegoede@redhat.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
nguyenanhgiau pushed a commit to nguyenanhgiau/linux that referenced this issue Jan 5, 2022
commit 57f0ff0 upstream.

It's later supposed to be either a correct address or NULL. Without the
initialization, it may contain an undefined value which results in the
following segmentation fault:

  # perf top --sort comm -g --ignore-callees=do_idle

terminates with:

  #0  0x00007ffff56b7685 in __strlen_avx2 () from /lib64/libc.so.6
  raspberrypi#1  0x00007ffff55e3802 in strdup () from /lib64/libc.so.6
  raspberrypi#2  0x00005555558cb139 in hist_entry__init (callchain_size=<optimized out>, sample_self=true, template=0x7fffde7fb110, he=0x7fffd801c250) at util/hist.c:489
  raspberrypi#3  hist_entry__new (template=template@entry=0x7fffde7fb110, sample_self=sample_self@entry=true) at util/hist.c:564
  raspberrypi#4  0x00005555558cb4ba in hists__findnew_entry (hists=hists@entry=0x5555561d9e38, entry=entry@entry=0x7fffde7fb110, al=al@entry=0x7fffde7fb420,
      sample_self=sample_self@entry=true) at util/hist.c:657
  raspberrypi#5  0x00005555558cba1b in __hists__add_entry (hists=hists@entry=0x5555561d9e38, al=0x7fffde7fb420, sym_parent=<optimized out>, bi=bi@entry=0x0, mi=mi@entry=0x0,
      sample=sample@entry=0x7fffde7fb4b0, sample_self=true, ops=0x0, block_info=0x0) at util/hist.c:288
  raspberrypi#6  0x00005555558cbb70 in hists__add_entry (sample_self=true, sample=0x7fffde7fb4b0, mi=0x0, bi=0x0, sym_parent=<optimized out>, al=<optimized out>, hists=0x5555561d9e38)
      at util/hist.c:1056
  raspberrypi#7  iter_add_single_cumulative_entry (iter=0x7fffde7fb460, al=<optimized out>) at util/hist.c:1056
  raspberrypi#8  0x00005555558cc8a4 in hist_entry_iter__add (iter=iter@entry=0x7fffde7fb460, al=al@entry=0x7fffde7fb420, max_stack_depth=<optimized out>, arg=arg@entry=0x7fffffff7db0)
      at util/hist.c:1231
  raspberrypi#9  0x00005555557cdc9a in perf_event__process_sample (machine=<optimized out>, sample=0x7fffde7fb4b0, evsel=<optimized out>, event=<optimized out>, tool=0x7fffffff7db0)
      at builtin-top.c:842
  raspberrypi#10 deliver_event (qe=<optimized out>, qevent=<optimized out>) at builtin-top.c:1202
  raspberrypi#11 0x00005555558a9318 in do_flush (show_progress=false, oe=0x7fffffff80e0) at util/ordered-events.c:244
  raspberrypi#12 __ordered_events__flush (oe=oe@entry=0x7fffffff80e0, how=how@entry=OE_FLUSH__TOP, timestamp=timestamp@entry=0) at util/ordered-events.c:323
  raspberrypi#13 0x00005555558a9789 in __ordered_events__flush (timestamp=<optimized out>, how=<optimized out>, oe=<optimized out>) at util/ordered-events.c:339
  raspberrypi#14 ordered_events__flush (how=OE_FLUSH__TOP, oe=0x7fffffff80e0) at util/ordered-events.c:341
  raspberrypi#15 ordered_events__flush (oe=oe@entry=0x7fffffff80e0, how=how@entry=OE_FLUSH__TOP) at util/ordered-events.c:339
  raspberrypi#16 0x00005555557cd631 in process_thread (arg=0x7fffffff7db0) at builtin-top.c:1114
  raspberrypi#17 0x00007ffff7bb817a in start_thread () from /lib64/libpthread.so.0
  raspberrypi#18 0x00007ffff5656dc3 in clone () from /lib64/libc.so.6

If you look at the frame raspberrypi#2, the code is:

488	 if (he->srcline) {
489          he->srcline = strdup(he->srcline);
490          if (he->srcline == NULL)
491              goto err_rawdata;
492	 }

If he->srcline is not NULL (it is not NULL if it is uninitialized rubbish),
it gets strdupped and strdupping a rubbish random string causes the problem.

Also, if you look at the commit 1fb7d06, it adds the srcline property
into the struct, but not initializing it everywhere needed.

Committer notes:

Now I see, when using --ignore-callees=do_idle we end up here at line
2189 in add_callchain_ip():

2181         if (al.sym != NULL) {
2182                 if (perf_hpp_list.parent && !*parent &&
2183                     symbol__match_regex(al.sym, &parent_regex))
2184                         *parent = al.sym;
2185                 else if (have_ignore_callees && root_al &&
2186                   symbol__match_regex(al.sym, &ignore_callees_regex)) {
2187                         /* Treat this symbol as the root,
2188                            forgetting its callees. */
2189                         *root_al = al;
2190                         callchain_cursor_reset(cursor);
2191                 }
2192         }

And the al that doesn't have the ->srcline field initialized will be
copied to the root_al, so then, back to:

1211 int hist_entry_iter__add(struct hist_entry_iter *iter, struct addr_location *al,
1212                          int max_stack_depth, void *arg)
1213 {
1214         int err, err2;
1215         struct map *alm = NULL;
1216
1217         if (al)
1218                 alm = map__get(al->map);
1219
1220         err = sample__resolve_callchain(iter->sample, &callchain_cursor, &iter->parent,
1221                                         iter->evsel, al, max_stack_depth);
1222         if (err) {
1223                 map__put(alm);
1224                 return err;
1225         }
1226
1227         err = iter->ops->prepare_entry(iter, al);
1228         if (err)
1229                 goto out;
1230
1231         err = iter->ops->add_single_entry(iter, al);
1232         if (err)
1233                 goto out;
1234

That al at line 1221 is what hist_entry_iter__add() (called from
sample__resolve_callchain()) saw as 'root_al', and then:

        iter->ops->add_single_entry(iter, al);

will go on with al->srcline with a bogus value, I'll add the above
sequence to the cset and apply, thanks!

Signed-off-by: Michael Petlan <mpetlan@redhat.com>
CC: Milian Wolff <milian.wolff@kdab.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Fixes: 1fb7d06 ("perf report Use srcline from callchain for hist entries")
Link: https //lore.kernel.org/r/20210719145332.29747-1-mpetlan@redhat.com
Reported-by: Juri Lelli <jlelli@redhat.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
nguyenanhgiau pushed a commit to nguyenanhgiau/linux that referenced this issue Jan 5, 2022
commit 19ea40d upstream.

[BUG]
There is a bug report that injected ENOMEM error could leave a tree
block locked while we return to user-space:

  BTRFS info (device loop0): enabling ssd optimizations
  FAULT_INJECTION: forcing a failure.
  name failslab, interval 1, probability 0, space 0, times 0
  CPU: 0 PID: 7579 Comm: syz-executor Not tainted 5.15.0-rc1 raspberrypi#16
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
  rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
  Call Trace:
   __dump_stack lib/dump_stack.c:88 [inline]
   dump_stack_lvl+0x8d/0xcf lib/dump_stack.c:106
   fail_dump lib/fault-inject.c:52 [inline]
   should_fail+0x13c/0x160 lib/fault-inject.c:146
   should_failslab+0x5/0x10 mm/slab_common.c:1328
   slab_pre_alloc_hook.constprop.99+0x4e/0xc0 mm/slab.h:494
   slab_alloc_node mm/slub.c:3120 [inline]
   slab_alloc mm/slub.c:3214 [inline]
   kmem_cache_alloc+0x44/0x280 mm/slub.c:3219
   btrfs_alloc_delayed_extent_op fs/btrfs/delayed-ref.h:299 [inline]
   btrfs_alloc_tree_block+0x38c/0x670 fs/btrfs/extent-tree.c:4833
   __btrfs_cow_block+0x16f/0x7d0 fs/btrfs/ctree.c:415
   btrfs_cow_block+0x12a/0x300 fs/btrfs/ctree.c:570
   btrfs_search_slot+0x6b0/0xee0 fs/btrfs/ctree.c:1768
   btrfs_insert_empty_items+0x80/0xf0 fs/btrfs/ctree.c:3905
   btrfs_new_inode+0x311/0xa60 fs/btrfs/inode.c:6530
   btrfs_create+0x12b/0x270 fs/btrfs/inode.c:6783
   lookup_open+0x660/0x780 fs/namei.c:3282
   open_last_lookups fs/namei.c:3352 [inline]
   path_openat+0x465/0xe20 fs/namei.c:3557
   do_filp_open+0xe3/0x170 fs/namei.c:3588
   do_sys_openat2+0x357/0x4a0 fs/open.c:1200
   do_sys_open+0x87/0xd0 fs/open.c:1216
   do_syscall_x64 arch/x86/entry/common.c:50 [inline]
   do_syscall_64+0x34/0xb0 arch/x86/entry/common.c:80
   entry_SYSCALL_64_after_hwframe+0x44/0xae
  RIP: 0033:0x46ae99
  Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48
  89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d
  01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
  RSP: 002b:00007f46711b9c48 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
  RAX: ffffffffffffffda RBX: 000000000078c0a0 RCX: 000000000046ae99
  RDX: 0000000000000000 RSI: 00000000000000a1 RDI: 0000000020005800
  RBP: 00007f46711b9c80 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000017
  R13: 0000000000000000 R14: 000000000078c0a0 R15: 00007ffc129da6e0

  ================================================
  WARNING: lock held when returning to user space!
  5.15.0-rc1 raspberrypi#16 Not tainted
  ------------------------------------------------
  syz-executor/7579 is leaving the kernel with locks still held!
  1 lock held by syz-executor/7579:
   #0: ffff888104b73da8 (btrfs-tree-01/1){+.+.}-{3:3}, at:
  __btrfs_tree_lock+0x2e/0x1a0 fs/btrfs/locking.c:112

[CAUSE]
In btrfs_alloc_tree_block(), after btrfs_init_new_buffer(), the new
extent buffer @buf is locked, but if later operations like adding
delayed tree ref fail, we just free @buf without unlocking it,
resulting above warning.

[FIX]
Unlock @buf in out_free_buf: label.

Reported-by: Hao Sun <sunhao.th@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CACkBjsZ9O6Zr0KK1yGn=1rQi6Crh1yeCRdTSBxx9R99L4xdn-Q@mail.gmail.com/
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
popcornmix pushed a commit that referenced this issue Jan 28, 2022
[ Upstream commit e4a41c2 ]

The following error is reported when running "./test_progs -t for_each"
under arm64:

  bpf_jit: multi-func JIT bug 58 != 56
  [...]
  JIT doesn't support bpf-to-bpf calls

The root cause is the size of BPF_PSEUDO_FUNC instruction increases
from 2 to 3 after the address of called bpf-function is settled and
there are two bpf-to-bpf calls in test_pkt_access. The generated
instructions are shown below:

  0x48:  21 00 C0 D2    movz x1, #0x1, lsl #32
  0x4c:  21 00 80 F2    movk x1, #0x1

  0x48:  E1 3F C0 92    movn x1, #0x1ff, lsl #32
  0x4c:  41 FE A2 F2    movk x1, #0x17f2, lsl #16
  0x50:  81 70 9F F2    movk x1, #0xfb84

Fixing it by using emit_addr_mov_i64() for BPF_PSEUDO_FUNC, so
the size of jited image will not change.

Fixes: 69c087b ("bpf: Add bpf_for_each_map_elem() helper")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20211231151018.3781550-1-houtao1@huawei.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
popcornmix pushed a commit that referenced this issue Jan 31, 2022
arm32 uses software to simulate the instruction replaced
by kprobe. some instructions may be simulated by constructing
assembly functions. therefore, before executing instruction
simulation, it is necessary to construct assembly function
execution environment in C language through binding registers.
after kasan is enabled, the register binding relationship will
be destroyed, resulting in instruction simulation errors and
causing kernel panic.

the kprobe emulate instruction function is distributed in three
files: actions-common.c actions-arm.c actions-thumb.c, so disable
KASAN when compiling these files.

for example, use kprobe insert on cap_capable+20 after kasan
enabled, the cap_capable assembly code is as follows:
<cap_capable>:
e92d47f0	push	{r4, r5, r6, r7, r8, r9, sl, lr}
e1a05000	mov	r5, r0
e280006c	add	r0, r0, #108    ; 0x6c
e1a04001	mov	r4, r1
e1a06002	mov	r6, r2
e59fa090	ldr	sl, [pc, #144]  ;
ebfc7bf8	bl	c03aa4b4 <__asan_load4>
e595706c	ldr	r7, [r5, #108]  ; 0x6c
e2859014	add	r9, r5, #20
......
The emulate_ldr assembly code after enabling kasan is as follows:
c06f1384 <emulate_ldr>:
e92d47f0	push	{r4, r5, r6, r7, r8, r9, sl, lr}
e282803c	add	r8, r2, #60     ; 0x3c
e1a05000	mov	r5, r0
e7e37855	ubfx	r7, r5, #16, #4
e1a00008	mov	r0, r8
e1a09001	mov	r9, r1
e1a04002	mov	r4, r2
ebf35462	bl	c03c6530 <__asan_load4>
e357000f	cmp	r7, #15
e7e36655	ubfx	r6, r5, #12, #4
e205a00f	and	sl, r5, #15
0a000001	beq	c06f13bc <emulate_ldr+0x38>
e0840107	add	r0, r4, r7, lsl #2
ebf3545c	bl	c03c6530 <__asan_load4>
e084010a	add	r0, r4, sl, lsl #2
ebf3545a	bl	c03c6530 <__asan_load4>
e2890010	add	r0, r9, #16
ebf35458	bl	c03c6530 <__asan_load4>
e5990010	ldr	r0, [r9, #16]
e12fff30	blx	r0
e356000f	cm	r6, #15
1a000014	bne	c06f1430 <emulate_ldr+0xac>
e1a06000	mov	r6, r0
e2840040	add	r0, r4, #64     ; 0x40
......

when running in emulate_ldr to simulate the ldr instruction, panic
occurred, and the log is as follows:
Unable to handle kernel NULL pointer dereference at virtual address
00000090
pgd = ecb46400
[00000090] *pgd=2e0fa003, *pmd=00000000
Internal error: Oops: 206 [#1] SMP ARM
PC is at cap_capable+0x14/0xb0
LR is at emulate_ldr+0x50/0xc0
psr: 600d0293 sp : ecd63af8  ip : 00000004  fp : c0a7c30c
r10: 00000000  r9 : c30897f4  r8 : ecd63cd4
r7 : 0000000f  r6 : 0000000a  r5 : e59fa090  r4 : ecd63c98
r3 : c06ae294  r2 : 00000000  r1 : b7611300  r0 : bf4ec008
Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
Control: 32c5387d  Table: 2d546400  DAC: 55555555
Process bash (pid: 1643, stack limit = 0xecd60190)
(cap_capable) from (kprobe_handler+0x218/0x340)
(kprobe_handler) from (kprobe_trap_handler+0x24/0x48)
(kprobe_trap_handler) from (do_undefinstr+0x13c/0x364)
(do_undefinstr) from (__und_svc_finish+0x0/0x30)
(__und_svc_finish) from (cap_capable+0x18/0xb0)
(cap_capable) from (cap_vm_enough_memory+0x38/0x48)
(cap_vm_enough_memory) from
(security_vm_enough_memory_mm+0x48/0x6c)
(security_vm_enough_memory_mm) from
(copy_process.constprop.5+0x16b4/0x25c8)
(copy_process.constprop.5) from (_do_fork+0xe8/0x55c)
(_do_fork) from (SyS_clone+0x1c/0x24)
(SyS_clone) from (__sys_trace_return+0x0/0x10)
Code: 0050a0e1 6c0080e2 0140a0e1 0260a0e1 (f801f0e7)

Fixes: 35aa1df ("ARM kprobes: instruction single-stepping support")
Fixes: 4210157 ("ARM: 9017/2: Enable KASan for ARM")
Signed-off-by: huangshaobo <huangshaobo6@huawei.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
popcornmix pushed a commit that referenced this issue Feb 7, 2022
commit 8b59b0a upstream.

arm32 uses software to simulate the instruction replaced
by kprobe. some instructions may be simulated by constructing
assembly functions. therefore, before executing instruction
simulation, it is necessary to construct assembly function
execution environment in C language through binding registers.
after kasan is enabled, the register binding relationship will
be destroyed, resulting in instruction simulation errors and
causing kernel panic.

the kprobe emulate instruction function is distributed in three
files: actions-common.c actions-arm.c actions-thumb.c, so disable
KASAN when compiling these files.

for example, use kprobe insert on cap_capable+20 after kasan
enabled, the cap_capable assembly code is as follows:
<cap_capable>:
e92d47f0	push	{r4, r5, r6, r7, r8, r9, sl, lr}
e1a05000	mov	r5, r0
e280006c	add	r0, r0, #108    ; 0x6c
e1a04001	mov	r4, r1
e1a06002	mov	r6, r2
e59fa090	ldr	sl, [pc, #144]  ;
ebfc7bf8	bl	c03aa4b4 <__asan_load4>
e595706c	ldr	r7, [r5, #108]  ; 0x6c
e2859014	add	r9, r5, #20
......
The emulate_ldr assembly code after enabling kasan is as follows:
c06f1384 <emulate_ldr>:
e92d47f0	push	{r4, r5, r6, r7, r8, r9, sl, lr}
e282803c	add	r8, r2, #60     ; 0x3c
e1a05000	mov	r5, r0
e7e37855	ubfx	r7, r5, #16, #4
e1a00008	mov	r0, r8
e1a09001	mov	r9, r1
e1a04002	mov	r4, r2
ebf35462	bl	c03c6530 <__asan_load4>
e357000f	cmp	r7, #15
e7e36655	ubfx	r6, r5, #12, #4
e205a00f	and	sl, r5, #15
0a000001	beq	c06f13bc <emulate_ldr+0x38>
e0840107	add	r0, r4, r7, lsl #2
ebf3545c	bl	c03c6530 <__asan_load4>
e084010a	add	r0, r4, sl, lsl #2
ebf3545a	bl	c03c6530 <__asan_load4>
e2890010	add	r0, r9, #16
ebf35458	bl	c03c6530 <__asan_load4>
e5990010	ldr	r0, [r9, #16]
e12fff30	blx	r0
e356000f	cm	r6, #15
1a000014	bne	c06f1430 <emulate_ldr+0xac>
e1a06000	mov	r6, r0
e2840040	add	r0, r4, #64     ; 0x40
......

when running in emulate_ldr to simulate the ldr instruction, panic
occurred, and the log is as follows:
Unable to handle kernel NULL pointer dereference at virtual address
00000090
pgd = ecb46400
[00000090] *pgd=2e0fa003, *pmd=00000000
Internal error: Oops: 206 [#1] SMP ARM
PC is at cap_capable+0x14/0xb0
LR is at emulate_ldr+0x50/0xc0
psr: 600d0293 sp : ecd63af8  ip : 00000004  fp : c0a7c30c
r10: 00000000  r9 : c30897f4  r8 : ecd63cd4
r7 : 0000000f  r6 : 0000000a  r5 : e59fa090  r4 : ecd63c98
r3 : c06ae294  r2 : 00000000  r1 : b7611300  r0 : bf4ec008
Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
Control: 32c5387d  Table: 2d546400  DAC: 55555555
Process bash (pid: 1643, stack limit = 0xecd60190)
(cap_capable) from (kprobe_handler+0x218/0x340)
(kprobe_handler) from (kprobe_trap_handler+0x24/0x48)
(kprobe_trap_handler) from (do_undefinstr+0x13c/0x364)
(do_undefinstr) from (__und_svc_finish+0x0/0x30)
(__und_svc_finish) from (cap_capable+0x18/0xb0)
(cap_capable) from (cap_vm_enough_memory+0x38/0x48)
(cap_vm_enough_memory) from
(security_vm_enough_memory_mm+0x48/0x6c)
(security_vm_enough_memory_mm) from
(copy_process.constprop.5+0x16b4/0x25c8)
(copy_process.constprop.5) from (_do_fork+0xe8/0x55c)
(_do_fork) from (SyS_clone+0x1c/0x24)
(SyS_clone) from (__sys_trace_return+0x0/0x10)
Code: 0050a0e1 6c0080e2 0140a0e1 0260a0e1 (f801f0e7)

Fixes: 35aa1df ("ARM kprobes: instruction single-stepping support")
Fixes: 4210157 ("ARM: 9017/2: Enable KASan for ARM")
Signed-off-by: huangshaobo <huangshaobo6@huawei.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
popcornmix pushed a commit that referenced this issue Feb 7, 2022
commit 8b59b0a upstream.

arm32 uses software to simulate the instruction replaced
by kprobe. some instructions may be simulated by constructing
assembly functions. therefore, before executing instruction
simulation, it is necessary to construct assembly function
execution environment in C language through binding registers.
after kasan is enabled, the register binding relationship will
be destroyed, resulting in instruction simulation errors and
causing kernel panic.

the kprobe emulate instruction function is distributed in three
files: actions-common.c actions-arm.c actions-thumb.c, so disable
KASAN when compiling these files.

for example, use kprobe insert on cap_capable+20 after kasan
enabled, the cap_capable assembly code is as follows:
<cap_capable>:
e92d47f0	push	{r4, r5, r6, r7, r8, r9, sl, lr}
e1a05000	mov	r5, r0
e280006c	add	r0, r0, #108    ; 0x6c
e1a04001	mov	r4, r1
e1a06002	mov	r6, r2
e59fa090	ldr	sl, [pc, #144]  ;
ebfc7bf8	bl	c03aa4b4 <__asan_load4>
e595706c	ldr	r7, [r5, #108]  ; 0x6c
e2859014	add	r9, r5, #20
......
The emulate_ldr assembly code after enabling kasan is as follows:
c06f1384 <emulate_ldr>:
e92d47f0	push	{r4, r5, r6, r7, r8, r9, sl, lr}
e282803c	add	r8, r2, #60     ; 0x3c
e1a05000	mov	r5, r0
e7e37855	ubfx	r7, r5, #16, #4
e1a00008	mov	r0, r8
e1a09001	mov	r9, r1
e1a04002	mov	r4, r2
ebf35462	bl	c03c6530 <__asan_load4>
e357000f	cmp	r7, #15
e7e36655	ubfx	r6, r5, #12, #4
e205a00f	and	sl, r5, #15
0a000001	beq	c06f13bc <emulate_ldr+0x38>
e0840107	add	r0, r4, r7, lsl #2
ebf3545c	bl	c03c6530 <__asan_load4>
e084010a	add	r0, r4, sl, lsl #2
ebf3545a	bl	c03c6530 <__asan_load4>
e2890010	add	r0, r9, #16
ebf35458	bl	c03c6530 <__asan_load4>
e5990010	ldr	r0, [r9, #16]
e12fff30	blx	r0
e356000f	cm	r6, #15
1a000014	bne	c06f1430 <emulate_ldr+0xac>
e1a06000	mov	r6, r0
e2840040	add	r0, r4, #64     ; 0x40
......

when running in emulate_ldr to simulate the ldr instruction, panic
occurred, and the log is as follows:
Unable to handle kernel NULL pointer dereference at virtual address
00000090
pgd = ecb46400
[00000090] *pgd=2e0fa003, *pmd=00000000
Internal error: Oops: 206 [#1] SMP ARM
PC is at cap_capable+0x14/0xb0
LR is at emulate_ldr+0x50/0xc0
psr: 600d0293 sp : ecd63af8  ip : 00000004  fp : c0a7c30c
r10: 00000000  r9 : c30897f4  r8 : ecd63cd4
r7 : 0000000f  r6 : 0000000a  r5 : e59fa090  r4 : ecd63c98
r3 : c06ae294  r2 : 00000000  r1 : b7611300  r0 : bf4ec008
Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
Control: 32c5387d  Table: 2d546400  DAC: 55555555
Process bash (pid: 1643, stack limit = 0xecd60190)
(cap_capable) from (kprobe_handler+0x218/0x340)
(kprobe_handler) from (kprobe_trap_handler+0x24/0x48)
(kprobe_trap_handler) from (do_undefinstr+0x13c/0x364)
(do_undefinstr) from (__und_svc_finish+0x0/0x30)
(__und_svc_finish) from (cap_capable+0x18/0xb0)
(cap_capable) from (cap_vm_enough_memory+0x38/0x48)
(cap_vm_enough_memory) from
(security_vm_enough_memory_mm+0x48/0x6c)
(security_vm_enough_memory_mm) from
(copy_process.constprop.5+0x16b4/0x25c8)
(copy_process.constprop.5) from (_do_fork+0xe8/0x55c)
(_do_fork) from (SyS_clone+0x1c/0x24)
(SyS_clone) from (__sys_trace_return+0x0/0x10)
Code: 0050a0e1 6c0080e2 0140a0e1 0260a0e1 (f801f0e7)

Fixes: 35aa1df ("ARM kprobes: instruction single-stepping support")
Fixes: 4210157 ("ARM: 9017/2: Enable KASan for ARM")
Signed-off-by: huangshaobo <huangshaobo6@huawei.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
whdgmawkd pushed a commit to whdgmawkd/linux that referenced this issue Feb 11, 2022
commit 8b59b0a upstream.

arm32 uses software to simulate the instruction replaced
by kprobe. some instructions may be simulated by constructing
assembly functions. therefore, before executing instruction
simulation, it is necessary to construct assembly function
execution environment in C language through binding registers.
after kasan is enabled, the register binding relationship will
be destroyed, resulting in instruction simulation errors and
causing kernel panic.

the kprobe emulate instruction function is distributed in three
files: actions-common.c actions-arm.c actions-thumb.c, so disable
KASAN when compiling these files.

for example, use kprobe insert on cap_capable+20 after kasan
enabled, the cap_capable assembly code is as follows:
<cap_capable>:
e92d47f0	push	{r4, r5, r6, r7, r8, r9, sl, lr}
e1a05000	mov	r5, r0
e280006c	add	r0, r0, raspberrypi#108    ; 0x6c
e1a04001	mov	r4, r1
e1a06002	mov	r6, r2
e59fa090	ldr	sl, [pc, raspberrypi#144]  ;
ebfc7bf8	bl	c03aa4b4 <__asan_load4>
e595706c	ldr	r7, [r5, raspberrypi#108]  ; 0x6c
e2859014	add	r9, r5, raspberrypi#20
......
The emulate_ldr assembly code after enabling kasan is as follows:
c06f1384 <emulate_ldr>:
e92d47f0	push	{r4, r5, r6, r7, r8, r9, sl, lr}
e282803c	add	r8, r2, raspberrypi#60     ; 0x3c
e1a05000	mov	r5, r0
e7e37855	ubfx	r7, r5, raspberrypi#16, raspberrypi#4
e1a00008	mov	r0, r8
e1a09001	mov	r9, r1
e1a04002	mov	r4, r2
ebf35462	bl	c03c6530 <__asan_load4>
e357000f	cmp	r7, raspberrypi#15
e7e36655	ubfx	r6, r5, raspberrypi#12, raspberrypi#4
e205a00f	and	sl, r5, raspberrypi#15
0a000001	beq	c06f13bc <emulate_ldr+0x38>
e0840107	add	r0, r4, r7, lsl raspberrypi#2
ebf3545c	bl	c03c6530 <__asan_load4>
e084010a	add	r0, r4, sl, lsl raspberrypi#2
ebf3545a	bl	c03c6530 <__asan_load4>
e2890010	add	r0, r9, raspberrypi#16
ebf35458	bl	c03c6530 <__asan_load4>
e5990010	ldr	r0, [r9, raspberrypi#16]
e12fff30	blx	r0
e356000f	cm	r6, raspberrypi#15
1a000014	bne	c06f1430 <emulate_ldr+0xac>
e1a06000	mov	r6, r0
e2840040	add	r0, r4, raspberrypi#64     ; 0x40
......

when running in emulate_ldr to simulate the ldr instruction, panic
occurred, and the log is as follows:
Unable to handle kernel NULL pointer dereference at virtual address
00000090
pgd = ecb46400
[00000090] *pgd=2e0fa003, *pmd=00000000
Internal error: Oops: 206 [raspberrypi#1] SMP ARM
PC is at cap_capable+0x14/0xb0
LR is at emulate_ldr+0x50/0xc0
psr: 600d0293 sp : ecd63af8  ip : 00000004  fp : c0a7c30c
r10: 00000000  r9 : c30897f4  r8 : ecd63cd4
r7 : 0000000f  r6 : 0000000a  r5 : e59fa090  r4 : ecd63c98
r3 : c06ae294  r2 : 00000000  r1 : b7611300  r0 : bf4ec008
Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
Control: 32c5387d  Table: 2d546400  DAC: 55555555
Process bash (pid: 1643, stack limit = 0xecd60190)
(cap_capable) from (kprobe_handler+0x218/0x340)
(kprobe_handler) from (kprobe_trap_handler+0x24/0x48)
(kprobe_trap_handler) from (do_undefinstr+0x13c/0x364)
(do_undefinstr) from (__und_svc_finish+0x0/0x30)
(__und_svc_finish) from (cap_capable+0x18/0xb0)
(cap_capable) from (cap_vm_enough_memory+0x38/0x48)
(cap_vm_enough_memory) from
(security_vm_enough_memory_mm+0x48/0x6c)
(security_vm_enough_memory_mm) from
(copy_process.constprop.5+0x16b4/0x25c8)
(copy_process.constprop.5) from (_do_fork+0xe8/0x55c)
(_do_fork) from (SyS_clone+0x1c/0x24)
(SyS_clone) from (__sys_trace_return+0x0/0x10)
Code: 0050a0e1 6c0080e2 0140a0e1 0260a0e1 (f801f0e7)

Fixes: 35aa1df ("ARM kprobes: instruction single-stepping support")
Fixes: 4210157 ("ARM: 9017/2: Enable KASan for ARM")
Signed-off-by: huangshaobo <huangshaobo6@huawei.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
popcornmix pushed a commit that referenced this issue Feb 28, 2022
When bringing down the netdevice or system shutdown, a panic can be
triggered while accessing the sysfs path because the device is already
removed.

    [  755.549084] mlx5_core 0000:12:00.1: Shutdown was called
    [  756.404455] mlx5_core 0000:12:00.0: Shutdown was called
    ...
    [  757.937260] BUG: unable to handle kernel NULL pointer dereference at           (null)
    [  758.031397] IP: [<ffffffff8ee11acb>] dma_pool_alloc+0x1ab/0x280

    crash> bt
    ...
    PID: 12649  TASK: ffff8924108f2100  CPU: 1   COMMAND: "amsd"
    ...
     #9 [ffff89240e1a38b0] page_fault at ffffffff8f38c778
        [exception RIP: dma_pool_alloc+0x1ab]
        RIP: ffffffff8ee11acb  RSP: ffff89240e1a3968  RFLAGS: 00010046
        RAX: 0000000000000246  RBX: ffff89243d874100  RCX: 0000000000001000
        RDX: 0000000000000000  RSI: 0000000000000246  RDI: ffff89243d874090
        RBP: ffff89240e1a39c0   R8: 000000000001f080   R9: ffff8905ffc03c00
        R10: ffffffffc04680d4  R11: ffffffff8edde9fd  R12: 00000000000080d0
        R13: ffff89243d874090  R14: ffff89243d874080  R15: 0000000000000000
        ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
    #10 [ffff89240e1a39c8] mlx5_alloc_cmd_msg at ffffffffc04680f3 [mlx5_core]
    #11 [ffff89240e1a3a18] cmd_exec at ffffffffc046ad62 [mlx5_core]
    #12 [ffff89240e1a3ab8] mlx5_cmd_exec at ffffffffc046b4fb [mlx5_core]
    #13 [ffff89240e1a3ae8] mlx5_core_access_reg at ffffffffc0475434 [mlx5_core]
    #14 [ffff89240e1a3b40] mlx5e_get_fec_caps at ffffffffc04a7348 [mlx5_core]
    #15 [ffff89240e1a3bb0] get_fec_supported_advertised at ffffffffc04992bf [mlx5_core]
    #16 [ffff89240e1a3c08] mlx5e_get_link_ksettings at ffffffffc049ab36 [mlx5_core]
    #17 [ffff89240e1a3ce8] __ethtool_get_link_ksettings at ffffffff8f25db46
    #18 [ffff89240e1a3d48] speed_show at ffffffff8f277208
    #19 [ffff89240e1a3dd8] dev_attr_show at ffffffff8f0b70e3
    #20 [ffff89240e1a3df8] sysfs_kf_seq_show at ffffffff8eedbedf
    #21 [ffff89240e1a3e18] kernfs_seq_show at ffffffff8eeda596
    #22 [ffff89240e1a3e28] seq_read at ffffffff8ee76d10
    #23 [ffff89240e1a3e98] kernfs_fop_read at ffffffff8eedaef5
    #24 [ffff89240e1a3ed8] vfs_read at ffffffff8ee4e3ff
    #25 [ffff89240e1a3f08] sys_read at ffffffff8ee4f27f
    #26 [ffff89240e1a3f50] system_call_fastpath at ffffffff8f395f92

    crash> net_device.state ffff89443b0c0000
      state = 0x5  (__LINK_STATE_START| __LINK_STATE_NOCARRIER)

To prevent this scenario, we also make sure that the netdevice is present.

Signed-off-by: suresh kumar <suresh2514@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
popcornmix pushed a commit that referenced this issue Feb 28, 2022
In current async pagefault logic, when a page is ready, KVM relies on
kvm_arch_can_dequeue_async_page_present() to determine whether to deliver
a READY event to the Guest. This function test token value of struct
kvm_vcpu_pv_apf_data, which must be reset to zero by Guest kernel when a
READY event is finished by Guest. If value is zero meaning that a READY
event is done, so the KVM can deliver another.
But the kvm_arch_setup_async_pf() may produce a valid token with zero
value, which is confused with previous mention and may lead the loss of
this READY event.

This bug may cause task blocked forever in Guest:
 INFO: task stress:7532 blocked for more than 1254 seconds.
       Not tainted 5.10.0 #16
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:stress          state:D stack:    0 pid: 7532 ppid:  1409
 flags:0x00000080
 Call Trace:
  __schedule+0x1e7/0x650
  schedule+0x46/0xb0
  kvm_async_pf_task_wait_schedule+0xad/0xe0
  ? exit_to_user_mode_prepare+0x60/0x70
  __kvm_handle_async_pf+0x4f/0xb0
  ? asm_exc_page_fault+0x8/0x30
  exc_page_fault+0x6f/0x110
  ? asm_exc_page_fault+0x8/0x30
  asm_exc_page_fault+0x1e/0x30
 RIP: 0033:0x402d00
 RSP: 002b:00007ffd31912500 EFLAGS: 00010206
 RAX: 0000000000071000 RBX: ffffffffffffffff RCX: 00000000021a32b0
 RDX: 000000000007d011 RSI: 000000000007d000 RDI: 00000000021262b0
 RBP: 00000000021262b0 R08: 0000000000000003 R09: 0000000000000086
 R10: 00000000000000eb R11: 00007fefbdf2baa0 R12: 0000000000000000
 R13: 0000000000000002 R14: 000000000007d000 R15: 0000000000001000

Signed-off-by: Liang Zhang <zhangliang5@huawei.com>
Message-Id: <20220222031239.1076682-1-zhangliang5@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
popcornmix pushed a commit that referenced this issue Mar 4, 2022
commit 6f3c1fc upstream.

In current async pagefault logic, when a page is ready, KVM relies on
kvm_arch_can_dequeue_async_page_present() to determine whether to deliver
a READY event to the Guest. This function test token value of struct
kvm_vcpu_pv_apf_data, which must be reset to zero by Guest kernel when a
READY event is finished by Guest. If value is zero meaning that a READY
event is done, so the KVM can deliver another.
But the kvm_arch_setup_async_pf() may produce a valid token with zero
value, which is confused with previous mention and may lead the loss of
this READY event.

This bug may cause task blocked forever in Guest:
 INFO: task stress:7532 blocked for more than 1254 seconds.
       Not tainted 5.10.0 #16
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:stress          state:D stack:    0 pid: 7532 ppid:  1409
 flags:0x00000080
 Call Trace:
  __schedule+0x1e7/0x650
  schedule+0x46/0xb0
  kvm_async_pf_task_wait_schedule+0xad/0xe0
  ? exit_to_user_mode_prepare+0x60/0x70
  __kvm_handle_async_pf+0x4f/0xb0
  ? asm_exc_page_fault+0x8/0x30
  exc_page_fault+0x6f/0x110
  ? asm_exc_page_fault+0x8/0x30
  asm_exc_page_fault+0x1e/0x30
 RIP: 0033:0x402d00
 RSP: 002b:00007ffd31912500 EFLAGS: 00010206
 RAX: 0000000000071000 RBX: ffffffffffffffff RCX: 00000000021a32b0
 RDX: 000000000007d011 RSI: 000000000007d000 RDI: 00000000021262b0
 RBP: 00000000021262b0 R08: 0000000000000003 R09: 0000000000000086
 R10: 00000000000000eb R11: 00007fefbdf2baa0 R12: 0000000000000000
 R13: 0000000000000002 R14: 000000000007d000 R15: 0000000000001000

Signed-off-by: Liang Zhang <zhangliang5@huawei.com>
Message-Id: <20220222031239.1076682-1-zhangliang5@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
popcornmix pushed a commit that referenced this issue Mar 4, 2022
commit 6f3c1fc upstream.

In current async pagefault logic, when a page is ready, KVM relies on
kvm_arch_can_dequeue_async_page_present() to determine whether to deliver
a READY event to the Guest. This function test token value of struct
kvm_vcpu_pv_apf_data, which must be reset to zero by Guest kernel when a
READY event is finished by Guest. If value is zero meaning that a READY
event is done, so the KVM can deliver another.
But the kvm_arch_setup_async_pf() may produce a valid token with zero
value, which is confused with previous mention and may lead the loss of
this READY event.

This bug may cause task blocked forever in Guest:
 INFO: task stress:7532 blocked for more than 1254 seconds.
       Not tainted 5.10.0 #16
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:stress          state:D stack:    0 pid: 7532 ppid:  1409
 flags:0x00000080
 Call Trace:
  __schedule+0x1e7/0x650
  schedule+0x46/0xb0
  kvm_async_pf_task_wait_schedule+0xad/0xe0
  ? exit_to_user_mode_prepare+0x60/0x70
  __kvm_handle_async_pf+0x4f/0xb0
  ? asm_exc_page_fault+0x8/0x30
  exc_page_fault+0x6f/0x110
  ? asm_exc_page_fault+0x8/0x30
  asm_exc_page_fault+0x1e/0x30
 RIP: 0033:0x402d00
 RSP: 002b:00007ffd31912500 EFLAGS: 00010206
 RAX: 0000000000071000 RBX: ffffffffffffffff RCX: 00000000021a32b0
 RDX: 000000000007d011 RSI: 000000000007d000 RDI: 00000000021262b0
 RBP: 00000000021262b0 R08: 0000000000000003 R09: 0000000000000086
 R10: 00000000000000eb R11: 00007fefbdf2baa0 R12: 0000000000000000
 R13: 0000000000000002 R14: 000000000007d000 R15: 0000000000001000

Signed-off-by: Liang Zhang <zhangliang5@huawei.com>
Message-Id: <20220222031239.1076682-1-zhangliang5@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
ajs124 pushed a commit to helsinki-systems/linux that referenced this issue Mar 7, 2022
commit 6f3c1fc upstream.

In current async pagefault logic, when a page is ready, KVM relies on
kvm_arch_can_dequeue_async_page_present() to determine whether to deliver
a READY event to the Guest. This function test token value of struct
kvm_vcpu_pv_apf_data, which must be reset to zero by Guest kernel when a
READY event is finished by Guest. If value is zero meaning that a READY
event is done, so the KVM can deliver another.
But the kvm_arch_setup_async_pf() may produce a valid token with zero
value, which is confused with previous mention and may lead the loss of
this READY event.

This bug may cause task blocked forever in Guest:
 INFO: task stress:7532 blocked for more than 1254 seconds.
       Not tainted 5.10.0 raspberrypi#16
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:stress          state:D stack:    0 pid: 7532 ppid:  1409
 flags:0x00000080
 Call Trace:
  __schedule+0x1e7/0x650
  schedule+0x46/0xb0
  kvm_async_pf_task_wait_schedule+0xad/0xe0
  ? exit_to_user_mode_prepare+0x60/0x70
  __kvm_handle_async_pf+0x4f/0xb0
  ? asm_exc_page_fault+0x8/0x30
  exc_page_fault+0x6f/0x110
  ? asm_exc_page_fault+0x8/0x30
  asm_exc_page_fault+0x1e/0x30
 RIP: 0033:0x402d00
 RSP: 002b:00007ffd31912500 EFLAGS: 00010206
 RAX: 0000000000071000 RBX: ffffffffffffffff RCX: 00000000021a32b0
 RDX: 000000000007d011 RSI: 000000000007d000 RDI: 00000000021262b0
 RBP: 00000000021262b0 R08: 0000000000000003 R09: 0000000000000086
 R10: 00000000000000eb R11: 00007fefbdf2baa0 R12: 0000000000000000
 R13: 0000000000000002 R14: 000000000007d000 R15: 0000000000001000

Signed-off-by: Liang Zhang <zhangliang5@huawei.com>
Message-Id: <20220222031239.1076682-1-zhangliang5@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
popcornmix pushed a commit that referenced this issue Mar 21, 2022
[ Upstream commit 4224cfd ]

When bringing down the netdevice or system shutdown, a panic can be
triggered while accessing the sysfs path because the device is already
removed.

    [  755.549084] mlx5_core 0000:12:00.1: Shutdown was called
    [  756.404455] mlx5_core 0000:12:00.0: Shutdown was called
    ...
    [  757.937260] BUG: unable to handle kernel NULL pointer dereference at           (null)
    [  758.031397] IP: [<ffffffff8ee11acb>] dma_pool_alloc+0x1ab/0x280

    crash> bt
    ...
    PID: 12649  TASK: ffff8924108f2100  CPU: 1   COMMAND: "amsd"
    ...
     #9 [ffff89240e1a38b0] page_fault at ffffffff8f38c778
        [exception RIP: dma_pool_alloc+0x1ab]
        RIP: ffffffff8ee11acb  RSP: ffff89240e1a3968  RFLAGS: 00010046
        RAX: 0000000000000246  RBX: ffff89243d874100  RCX: 0000000000001000
        RDX: 0000000000000000  RSI: 0000000000000246  RDI: ffff89243d874090
        RBP: ffff89240e1a39c0   R8: 000000000001f080   R9: ffff8905ffc03c00
        R10: ffffffffc04680d4  R11: ffffffff8edde9fd  R12: 00000000000080d0
        R13: ffff89243d874090  R14: ffff89243d874080  R15: 0000000000000000
        ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
    #10 [ffff89240e1a39c8] mlx5_alloc_cmd_msg at ffffffffc04680f3 [mlx5_core]
    #11 [ffff89240e1a3a18] cmd_exec at ffffffffc046ad62 [mlx5_core]
    #12 [ffff89240e1a3ab8] mlx5_cmd_exec at ffffffffc046b4fb [mlx5_core]
    #13 [ffff89240e1a3ae8] mlx5_core_access_reg at ffffffffc0475434 [mlx5_core]
    #14 [ffff89240e1a3b40] mlx5e_get_fec_caps at ffffffffc04a7348 [mlx5_core]
    #15 [ffff89240e1a3bb0] get_fec_supported_advertised at ffffffffc04992bf [mlx5_core]
    #16 [ffff89240e1a3c08] mlx5e_get_link_ksettings at ffffffffc049ab36 [mlx5_core]
    #17 [ffff89240e1a3ce8] __ethtool_get_link_ksettings at ffffffff8f25db46
    #18 [ffff89240e1a3d48] speed_show at ffffffff8f277208
    #19 [ffff89240e1a3dd8] dev_attr_show at ffffffff8f0b70e3
    #20 [ffff89240e1a3df8] sysfs_kf_seq_show at ffffffff8eedbedf
    #21 [ffff89240e1a3e18] kernfs_seq_show at ffffffff8eeda596
    #22 [ffff89240e1a3e28] seq_read at ffffffff8ee76d10
    #23 [ffff89240e1a3e98] kernfs_fop_read at ffffffff8eedaef5
    #24 [ffff89240e1a3ed8] vfs_read at ffffffff8ee4e3ff
    #25 [ffff89240e1a3f08] sys_read at ffffffff8ee4f27f
    #26 [ffff89240e1a3f50] system_call_fastpath at ffffffff8f395f92

    crash> net_device.state ffff89443b0c0000
      state = 0x5  (__LINK_STATE_START| __LINK_STATE_NOCARRIER)

To prevent this scenario, we also make sure that the netdevice is present.

Signed-off-by: suresh kumar <suresh2514@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
popcornmix pushed a commit that referenced this issue Mar 21, 2022
[ Upstream commit 4224cfd ]

When bringing down the netdevice or system shutdown, a panic can be
triggered while accessing the sysfs path because the device is already
removed.

    [  755.549084] mlx5_core 0000:12:00.1: Shutdown was called
    [  756.404455] mlx5_core 0000:12:00.0: Shutdown was called
    ...
    [  757.937260] BUG: unable to handle kernel NULL pointer dereference at           (null)
    [  758.031397] IP: [<ffffffff8ee11acb>] dma_pool_alloc+0x1ab/0x280

    crash> bt
    ...
    PID: 12649  TASK: ffff8924108f2100  CPU: 1   COMMAND: "amsd"
    ...
     #9 [ffff89240e1a38b0] page_fault at ffffffff8f38c778
        [exception RIP: dma_pool_alloc+0x1ab]
        RIP: ffffffff8ee11acb  RSP: ffff89240e1a3968  RFLAGS: 00010046
        RAX: 0000000000000246  RBX: ffff89243d874100  RCX: 0000000000001000
        RDX: 0000000000000000  RSI: 0000000000000246  RDI: ffff89243d874090
        RBP: ffff89240e1a39c0   R8: 000000000001f080   R9: ffff8905ffc03c00
        R10: ffffffffc04680d4  R11: ffffffff8edde9fd  R12: 00000000000080d0
        R13: ffff89243d874090  R14: ffff89243d874080  R15: 0000000000000000
        ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
    #10 [ffff89240e1a39c8] mlx5_alloc_cmd_msg at ffffffffc04680f3 [mlx5_core]
    #11 [ffff89240e1a3a18] cmd_exec at ffffffffc046ad62 [mlx5_core]
    #12 [ffff89240e1a3ab8] mlx5_cmd_exec at ffffffffc046b4fb [mlx5_core]
    #13 [ffff89240e1a3ae8] mlx5_core_access_reg at ffffffffc0475434 [mlx5_core]
    #14 [ffff89240e1a3b40] mlx5e_get_fec_caps at ffffffffc04a7348 [mlx5_core]
    #15 [ffff89240e1a3bb0] get_fec_supported_advertised at ffffffffc04992bf [mlx5_core]
    #16 [ffff89240e1a3c08] mlx5e_get_link_ksettings at ffffffffc049ab36 [mlx5_core]
    #17 [ffff89240e1a3ce8] __ethtool_get_link_ksettings at ffffffff8f25db46
    #18 [ffff89240e1a3d48] speed_show at ffffffff8f277208
    #19 [ffff89240e1a3dd8] dev_attr_show at ffffffff8f0b70e3
    #20 [ffff89240e1a3df8] sysfs_kf_seq_show at ffffffff8eedbedf
    #21 [ffff89240e1a3e18] kernfs_seq_show at ffffffff8eeda596
    #22 [ffff89240e1a3e28] seq_read at ffffffff8ee76d10
    #23 [ffff89240e1a3e98] kernfs_fop_read at ffffffff8eedaef5
    #24 [ffff89240e1a3ed8] vfs_read at ffffffff8ee4e3ff
    #25 [ffff89240e1a3f08] sys_read at ffffffff8ee4f27f
    #26 [ffff89240e1a3f50] system_call_fastpath at ffffffff8f395f92

    crash> net_device.state ffff89443b0c0000
      state = 0x5  (__LINK_STATE_START| __LINK_STATE_NOCARRIER)

To prevent this scenario, we also make sure that the netdevice is present.

Signed-off-by: suresh kumar <suresh2514@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
herrnst pushed a commit to herrnst/linux-raspberrypi that referenced this issue Apr 8, 2022
[ Upstream commit 614c0b9 ]

We may call arm64_apply_bp_hardening() early during entry (e.g. in
el0_ia()) before it is safe to run instrumented code. Unfortunately this
may result in running instrumented code in two cases:

* The hardening callbacks called by arm64_apply_bp_hardening() are not
  marked as `noinstr`, and have been observed to be instrumented when
  compiled with either GCC or LLVM.

* Since arm64_apply_bp_hardening() itself is only marked as `inline`
  rather than `__always_inline`, it is possible that the compiler
  decides to place it out-of-line, whereupon it may be instrumented.

For example, with defconfig built with clang 13.0.0,
call_hvc_arch_workaround_1() is compiled as:

| <call_hvc_arch_workaround_1>:
|        d503233f        paciasp
|        f81f0ffe        str     x30, [sp, #-16]!
|        320183e0        mov     w0, #0x80008000
|        d503201f        nop
|        d4000002        hvc     #0x0
|        f84107fe        ldr     x30, [sp], raspberrypi#16
|        d50323bf        autiasp
|        d65f03c0        ret

... but when CONFIG_FTRACE=y and CONFIG_KCOV=y this is compiled as:

| <call_hvc_arch_workaround_1>:
|        d503245f        bti     c
|        d503201f        nop
|        d503201f        nop
|        d503233f        paciasp
|        a9bf7bfd        stp     x29, x30, [sp, #-16]!
|        910003fd        mov     x29, sp
|        94000000        bl      0 <__sanitizer_cov_trace_pc>
|        320183e0        mov     w0, #0x80008000
|        d503201f        nop
|        d4000002        hvc     #0x0
|        a8c17bfd        ldp     x29, x30, [sp], raspberrypi#16
|        d50323bf        autiasp
|        d65f03c0        ret

... with a patchable function entry registered with ftrace, and a direct
call to __sanitizer_cov_trace_pc(). Neither of these are safe early
during entry sequences.

This patch avoids the unsafe instrumentation by marking
arm64_apply_bp_hardening() as `__always_inline` and by marking the
hardening functions as `noinstr`. This avoids the potential for
instrumentation, and causes clang to consistently generate the function
as with the defconfig sample.

Note: in the defconfig compilation, when CONFIG_SVE=y, x30 is spilled to
the stack without being placed in a frame record, which will result in a
missing entry if call_hvc_arch_workaround_1() is backtraced. Similar is
true of qcom_link_stack_sanitisation(), where inline asm spills the LR
to a GPR prior to corrupting it. This is not a significant issue
presently as we will only backtrace here if an exception is taken, and
in such cases we may omit entries for other reasons today.

The relevant hardening functions were introduced in commits:

  ec82b56 ("arm64: Implement branch predictor hardening for Falkor")
  b092201 ("arm64: Add ARM_SMCCC_ARCH_WORKAROUND_1 BP hardening support")

... and these were subsequently moved in commit:

  d4647f0 ("arm64: Rewrite Spectre-v2 mitigation code")

The arm64_apply_bp_hardening() function was introduced in commit:

  0f15adb ("arm64: Add skeleton to harden the branch predictor against aliasing attacks")

... and was subsequently moved and reworked in commit:

  6279017 ("KVM: arm64: Move BP hardening helpers into spectre.h")

Fixes: ec82b56 ("arm64: Implement branch predictor hardening for Falkor")
Fixes: b092201 ("arm64: Add ARM_SMCCC_ARCH_WORKAROUND_1 BP hardening support")
Fixes: d4647f0 ("arm64: Rewrite Spectre-v2 mitigation code")
Fixes: 0f15adb ("arm64: Add skeleton to harden the branch predictor against aliasing attacks")
Fixes: 6279017 ("KVM: arm64: Move BP hardening helpers into spectre.h")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Acked-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20220224181028.512873-1-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
herrnst pushed a commit to herrnst/linux-raspberrypi that referenced this issue Apr 8, 2022
[ Upstream commit fe2640b ]

In remove_phb_dynamic() we use &phb->io_resource, after we've called
device_unregister(&host_bridge->dev). But the unregister may have freed
phb, because pcibios_free_controller_deferred() is the release function
for the host_bridge.

If there are no outstanding references when we call device_unregister()
then phb will be freed out from under us.

This has gone mainly unnoticed, but with slub_debug and page_poison
enabled it can lead to a crash:

  PID: 7574   TASK: c0000000d492cb80  CPU: 13  COMMAND: "drmgr"
   #0 [c0000000e4f075a0] crash_kexec at c00000000027d7dc
   raspberrypi#1 [c0000000e4f075d0] oops_end at c000000000029608
   raspberrypi#2 [c0000000e4f07650] __bad_page_fault at c0000000000904b4
   raspberrypi#3 [c0000000e4f076c0] do_bad_slb_fault at c00000000009a5a8
   raspberrypi#4 [c0000000e4f076f0] data_access_slb_common_virt at c000000000008b30
   Data SLB Access [380] exception frame:
   R0:  c000000000167250    R1:  c0000000e4f07a00    R2:  c000000002a46100
   R3:  c000000002b39ce8    R4:  00000000000000c0    R5:  00000000000000a9
   R6:  3894674d000000c0    R7:  0000000000000000    R8:  00000000000000ff
   R9:  0000000000000100    R10: 6b6b6b6b6b6b6b6b    R11: 0000000000008000
   R12: c00000000023da80    R13: c0000009ffd38b00    R14: 0000000000000000
   R15: 000000011c87f0f0    R16: 0000000000000006    R17: 0000000000000003
   R18: 0000000000000002    R19: 0000000000000004    R20: 0000000000000005
   R21: 000000011c87ede8    R22: 000000011c87c5a8    R23: 000000011c87d3a0
   R24: 0000000000000000    R25: 0000000000000001    R26: c0000000e4f07cc8
   R27: c00000004d1cc400    R28: c0080000031d00e8    R29: c00000004d23d800
   R30: c00000004d1d2400    R31: c00000004d1d2540
   NIP: c000000000167258    MSR: 8000000000009033    OR3: c000000000e9f474
   CTR: 0000000000000000    LR:  c000000000167250    XER: 0000000020040003
   CCR: 0000000024088420    MQ:  0000000000000000    DAR: 6b6b6b6b6b6b6ba3
   DSISR: c0000000e4f07920     Syscall Result: fffffffffffffff2
   [NIP  : release_resource+56]
   [LR   : release_resource+48]
   raspberrypi#5 [c0000000e4f07a00] release_resource at c000000000167258  (unreliable)
   raspberrypi#6 [c0000000e4f07a30] remove_phb_dynamic at c000000000105648
   raspberrypi#7 [c0000000e4f07ab0] dlpar_remove_slot at c0080000031a09e8 [rpadlpar_io]
   raspberrypi#8 [c0000000e4f07b50] remove_slot_store at c0080000031a0b9c [rpadlpar_io]
   raspberrypi#9 [c0000000e4f07be0] kobj_attr_store at c000000000817d8c
  raspberrypi#10 [c0000000e4f07c00] sysfs_kf_write at c00000000063e504
  raspberrypi#11 [c0000000e4f07c20] kernfs_fop_write_iter at c00000000063d868
  raspberrypi#12 [c0000000e4f07c70] new_sync_write at c00000000054339c
  raspberrypi#13 [c0000000e4f07d10] vfs_write at c000000000546624
  raspberrypi#14 [c0000000e4f07d60] ksys_write at c0000000005469f4
  raspberrypi#15 [c0000000e4f07db0] system_call_exception at c000000000030840
  raspberrypi#16 [c0000000e4f07e10] system_call_vectored_common at c00000000000c168

To avoid it, we can take a reference to the host_bridge->dev until we're
done using phb. Then when we drop the reference the phb will be freed.

Fixes: 2dd9c11 ("powerpc/pseries: use pci_host_bridge.release_fn() to kfree(phb)")
Reported-by: David Dai <zdai@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Link: https://lore.kernel.org/r/20220318034219.1188008-1-mpe@ellerman.id.au
Signed-off-by: Sasha Levin <sashal@kernel.org>
popcornmix pushed a commit that referenced this issue Apr 8, 2022
[ Upstream commit 4224cfd ]

When bringing down the netdevice or system shutdown, a panic can be
triggered while accessing the sysfs path because the device is already
removed.

    [  755.549084] mlx5_core 0000:12:00.1: Shutdown was called
    [  756.404455] mlx5_core 0000:12:00.0: Shutdown was called
    ...
    [  757.937260] BUG: unable to handle kernel NULL pointer dereference at           (null)
    [  758.031397] IP: [<ffffffff8ee11acb>] dma_pool_alloc+0x1ab/0x280

    crash> bt
    ...
    PID: 12649  TASK: ffff8924108f2100  CPU: 1   COMMAND: "amsd"
    ...
     #9 [ffff89240e1a38b0] page_fault at ffffffff8f38c778
        [exception RIP: dma_pool_alloc+0x1ab]
        RIP: ffffffff8ee11acb  RSP: ffff89240e1a3968  RFLAGS: 00010046
        RAX: 0000000000000246  RBX: ffff89243d874100  RCX: 0000000000001000
        RDX: 0000000000000000  RSI: 0000000000000246  RDI: ffff89243d874090
        RBP: ffff89240e1a39c0   R8: 000000000001f080   R9: ffff8905ffc03c00
        R10: ffffffffc04680d4  R11: ffffffff8edde9fd  R12: 00000000000080d0
        R13: ffff89243d874090  R14: ffff89243d874080  R15: 0000000000000000
        ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
    #10 [ffff89240e1a39c8] mlx5_alloc_cmd_msg at ffffffffc04680f3 [mlx5_core]
    #11 [ffff89240e1a3a18] cmd_exec at ffffffffc046ad62 [mlx5_core]
    #12 [ffff89240e1a3ab8] mlx5_cmd_exec at ffffffffc046b4fb [mlx5_core]
    #13 [ffff89240e1a3ae8] mlx5_core_access_reg at ffffffffc0475434 [mlx5_core]
    #14 [ffff89240e1a3b40] mlx5e_get_fec_caps at ffffffffc04a7348 [mlx5_core]
    #15 [ffff89240e1a3bb0] get_fec_supported_advertised at ffffffffc04992bf [mlx5_core]
    #16 [ffff89240e1a3c08] mlx5e_get_link_ksettings at ffffffffc049ab36 [mlx5_core]
    #17 [ffff89240e1a3ce8] __ethtool_get_link_ksettings at ffffffff8f25db46
    #18 [ffff89240e1a3d48] speed_show at ffffffff8f277208
    #19 [ffff89240e1a3dd8] dev_attr_show at ffffffff8f0b70e3
    #20 [ffff89240e1a3df8] sysfs_kf_seq_show at ffffffff8eedbedf
    #21 [ffff89240e1a3e18] kernfs_seq_show at ffffffff8eeda596
    #22 [ffff89240e1a3e28] seq_read at ffffffff8ee76d10
    #23 [ffff89240e1a3e98] kernfs_fop_read at ffffffff8eedaef5
    #24 [ffff89240e1a3ed8] vfs_read at ffffffff8ee4e3ff
    #25 [ffff89240e1a3f08] sys_read at ffffffff8ee4f27f
    #26 [ffff89240e1a3f50] system_call_fastpath at ffffffff8f395f92

    crash> net_device.state ffff89443b0c0000
      state = 0x5  (__LINK_STATE_START| __LINK_STATE_NOCARRIER)

To prevent this scenario, we also make sure that the netdevice is present.

Signed-off-by: suresh kumar <suresh2514@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
popcornmix pushed a commit that referenced this issue Apr 11, 2022
[ Upstream commit 614c0b9 ]

We may call arm64_apply_bp_hardening() early during entry (e.g. in
el0_ia()) before it is safe to run instrumented code. Unfortunately this
may result in running instrumented code in two cases:

* The hardening callbacks called by arm64_apply_bp_hardening() are not
  marked as `noinstr`, and have been observed to be instrumented when
  compiled with either GCC or LLVM.

* Since arm64_apply_bp_hardening() itself is only marked as `inline`
  rather than `__always_inline`, it is possible that the compiler
  decides to place it out-of-line, whereupon it may be instrumented.

For example, with defconfig built with clang 13.0.0,
call_hvc_arch_workaround_1() is compiled as:

| <call_hvc_arch_workaround_1>:
|        d503233f        paciasp
|        f81f0ffe        str     x30, [sp, #-16]!
|        320183e0        mov     w0, #0x80008000
|        d503201f        nop
|        d4000002        hvc     #0x0
|        f84107fe        ldr     x30, [sp], #16
|        d50323bf        autiasp
|        d65f03c0        ret

... but when CONFIG_FTRACE=y and CONFIG_KCOV=y this is compiled as:

| <call_hvc_arch_workaround_1>:
|        d503245f        bti     c
|        d503201f        nop
|        d503201f        nop
|        d503233f        paciasp
|        a9bf7bfd        stp     x29, x30, [sp, #-16]!
|        910003fd        mov     x29, sp
|        94000000        bl      0 <__sanitizer_cov_trace_pc>
|        320183e0        mov     w0, #0x80008000
|        d503201f        nop
|        d4000002        hvc     #0x0
|        a8c17bfd        ldp     x29, x30, [sp], #16
|        d50323bf        autiasp
|        d65f03c0        ret

... with a patchable function entry registered with ftrace, and a direct
call to __sanitizer_cov_trace_pc(). Neither of these are safe early
during entry sequences.

This patch avoids the unsafe instrumentation by marking
arm64_apply_bp_hardening() as `__always_inline` and by marking the
hardening functions as `noinstr`. This avoids the potential for
instrumentation, and causes clang to consistently generate the function
as with the defconfig sample.

Note: in the defconfig compilation, when CONFIG_SVE=y, x30 is spilled to
the stack without being placed in a frame record, which will result in a
missing entry if call_hvc_arch_workaround_1() is backtraced. Similar is
true of qcom_link_stack_sanitisation(), where inline asm spills the LR
to a GPR prior to corrupting it. This is not a significant issue
presently as we will only backtrace here if an exception is taken, and
in such cases we may omit entries for other reasons today.

The relevant hardening functions were introduced in commits:

  ec82b56 ("arm64: Implement branch predictor hardening for Falkor")
  b092201 ("arm64: Add ARM_SMCCC_ARCH_WORKAROUND_1 BP hardening support")

... and these were subsequently moved in commit:

  d4647f0 ("arm64: Rewrite Spectre-v2 mitigation code")

The arm64_apply_bp_hardening() function was introduced in commit:

  0f15adb ("arm64: Add skeleton to harden the branch predictor against aliasing attacks")

... and was subsequently moved and reworked in commit:

  6279017 ("KVM: arm64: Move BP hardening helpers into spectre.h")

Fixes: ec82b56 ("arm64: Implement branch predictor hardening for Falkor")
Fixes: b092201 ("arm64: Add ARM_SMCCC_ARCH_WORKAROUND_1 BP hardening support")
Fixes: d4647f0 ("arm64: Rewrite Spectre-v2 mitigation code")
Fixes: 0f15adb ("arm64: Add skeleton to harden the branch predictor against aliasing attacks")
Fixes: 6279017 ("KVM: arm64: Move BP hardening helpers into spectre.h")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Acked-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20220224181028.512873-1-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
popcornmix pushed a commit that referenced this issue Apr 11, 2022
[ Upstream commit fe2640b ]

In remove_phb_dynamic() we use &phb->io_resource, after we've called
device_unregister(&host_bridge->dev). But the unregister may have freed
phb, because pcibios_free_controller_deferred() is the release function
for the host_bridge.

If there are no outstanding references when we call device_unregister()
then phb will be freed out from under us.

This has gone mainly unnoticed, but with slub_debug and page_poison
enabled it can lead to a crash:

  PID: 7574   TASK: c0000000d492cb80  CPU: 13  COMMAND: "drmgr"
   #0 [c0000000e4f075a0] crash_kexec at c00000000027d7dc
   #1 [c0000000e4f075d0] oops_end at c000000000029608
   #2 [c0000000e4f07650] __bad_page_fault at c0000000000904b4
   #3 [c0000000e4f076c0] do_bad_slb_fault at c00000000009a5a8
   #4 [c0000000e4f076f0] data_access_slb_common_virt at c000000000008b30
   Data SLB Access [380] exception frame:
   R0:  c000000000167250    R1:  c0000000e4f07a00    R2:  c000000002a46100
   R3:  c000000002b39ce8    R4:  00000000000000c0    R5:  00000000000000a9
   R6:  3894674d000000c0    R7:  0000000000000000    R8:  00000000000000ff
   R9:  0000000000000100    R10: 6b6b6b6b6b6b6b6b    R11: 0000000000008000
   R12: c00000000023da80    R13: c0000009ffd38b00    R14: 0000000000000000
   R15: 000000011c87f0f0    R16: 0000000000000006    R17: 0000000000000003
   R18: 0000000000000002    R19: 0000000000000004    R20: 0000000000000005
   R21: 000000011c87ede8    R22: 000000011c87c5a8    R23: 000000011c87d3a0
   R24: 0000000000000000    R25: 0000000000000001    R26: c0000000e4f07cc8
   R27: c00000004d1cc400    R28: c0080000031d00e8    R29: c00000004d23d800
   R30: c00000004d1d2400    R31: c00000004d1d2540
   NIP: c000000000167258    MSR: 8000000000009033    OR3: c000000000e9f474
   CTR: 0000000000000000    LR:  c000000000167250    XER: 0000000020040003
   CCR: 0000000024088420    MQ:  0000000000000000    DAR: 6b6b6b6b6b6b6ba3
   DSISR: c0000000e4f07920     Syscall Result: fffffffffffffff2
   [NIP  : release_resource+56]
   [LR   : release_resource+48]
   #5 [c0000000e4f07a00] release_resource at c000000000167258  (unreliable)
   #6 [c0000000e4f07a30] remove_phb_dynamic at c000000000105648
   #7 [c0000000e4f07ab0] dlpar_remove_slot at c0080000031a09e8 [rpadlpar_io]
   #8 [c0000000e4f07b50] remove_slot_store at c0080000031a0b9c [rpadlpar_io]
   #9 [c0000000e4f07be0] kobj_attr_store at c000000000817d8c
  #10 [c0000000e4f07c00] sysfs_kf_write at c00000000063e504
  #11 [c0000000e4f07c20] kernfs_fop_write_iter at c00000000063d868
  #12 [c0000000e4f07c70] new_sync_write at c00000000054339c
  #13 [c0000000e4f07d10] vfs_write at c000000000546624
  #14 [c0000000e4f07d60] ksys_write at c0000000005469f4
  #15 [c0000000e4f07db0] system_call_exception at c000000000030840
  #16 [c0000000e4f07e10] system_call_vectored_common at c00000000000c168

To avoid it, we can take a reference to the host_bridge->dev until we're
done using phb. Then when we drop the reference the phb will be freed.

Fixes: 2dd9c11 ("powerpc/pseries: use pci_host_bridge.release_fn() to kfree(phb)")
Reported-by: David Dai <zdai@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Link: https://lore.kernel.org/r/20220318034219.1188008-1-mpe@ellerman.id.au
Signed-off-by: Sasha Levin <sashal@kernel.org>
popcornmix pushed a commit that referenced this issue Apr 11, 2022
[ Upstream commit 614c0b9 ]

We may call arm64_apply_bp_hardening() early during entry (e.g. in
el0_ia()) before it is safe to run instrumented code. Unfortunately this
may result in running instrumented code in two cases:

* The hardening callbacks called by arm64_apply_bp_hardening() are not
  marked as `noinstr`, and have been observed to be instrumented when
  compiled with either GCC or LLVM.

* Since arm64_apply_bp_hardening() itself is only marked as `inline`
  rather than `__always_inline`, it is possible that the compiler
  decides to place it out-of-line, whereupon it may be instrumented.

For example, with defconfig built with clang 13.0.0,
call_hvc_arch_workaround_1() is compiled as:

| <call_hvc_arch_workaround_1>:
|        d503233f        paciasp
|        f81f0ffe        str     x30, [sp, #-16]!
|        320183e0        mov     w0, #0x80008000
|        d503201f        nop
|        d4000002        hvc     #0x0
|        f84107fe        ldr     x30, [sp], #16
|        d50323bf        autiasp
|        d65f03c0        ret

... but when CONFIG_FTRACE=y and CONFIG_KCOV=y this is compiled as:

| <call_hvc_arch_workaround_1>:
|        d503245f        bti     c
|        d503201f        nop
|        d503201f        nop
|        d503233f        paciasp
|        a9bf7bfd        stp     x29, x30, [sp, #-16]!
|        910003fd        mov     x29, sp
|        94000000        bl      0 <__sanitizer_cov_trace_pc>
|        320183e0        mov     w0, #0x80008000
|        d503201f        nop
|        d4000002        hvc     #0x0
|        a8c17bfd        ldp     x29, x30, [sp], #16
|        d50323bf        autiasp
|        d65f03c0        ret

... with a patchable function entry registered with ftrace, and a direct
call to __sanitizer_cov_trace_pc(). Neither of these are safe early
during entry sequences.

This patch avoids the unsafe instrumentation by marking
arm64_apply_bp_hardening() as `__always_inline` and by marking the
hardening functions as `noinstr`. This avoids the potential for
instrumentation, and causes clang to consistently generate the function
as with the defconfig sample.

Note: in the defconfig compilation, when CONFIG_SVE=y, x30 is spilled to
the stack without being placed in a frame record, which will result in a
missing entry if call_hvc_arch_workaround_1() is backtraced. Similar is
true of qcom_link_stack_sanitisation(), where inline asm spills the LR
to a GPR prior to corrupting it. This is not a significant issue
presently as we will only backtrace here if an exception is taken, and
in such cases we may omit entries for other reasons today.

The relevant hardening functions were introduced in commits:

  ec82b56 ("arm64: Implement branch predictor hardening for Falkor")
  b092201 ("arm64: Add ARM_SMCCC_ARCH_WORKAROUND_1 BP hardening support")

... and these were subsequently moved in commit:

  d4647f0 ("arm64: Rewrite Spectre-v2 mitigation code")

The arm64_apply_bp_hardening() function was introduced in commit:

  0f15adb ("arm64: Add skeleton to harden the branch predictor against aliasing attacks")

... and was subsequently moved and reworked in commit:

  6279017 ("KVM: arm64: Move BP hardening helpers into spectre.h")

Fixes: ec82b56 ("arm64: Implement branch predictor hardening for Falkor")
Fixes: b092201 ("arm64: Add ARM_SMCCC_ARCH_WORKAROUND_1 BP hardening support")
Fixes: d4647f0 ("arm64: Rewrite Spectre-v2 mitigation code")
Fixes: 0f15adb ("arm64: Add skeleton to harden the branch predictor against aliasing attacks")
Fixes: 6279017 ("KVM: arm64: Move BP hardening helpers into spectre.h")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Brown <broonie@kernel.org>
Cc: Will Deacon <will@kernel.org>
Acked-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Mark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20220224181028.512873-1-mark.rutland@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
popcornmix pushed a commit that referenced this issue Apr 11, 2022
[ Upstream commit fe2640b ]

In remove_phb_dynamic() we use &phb->io_resource, after we've called
device_unregister(&host_bridge->dev). But the unregister may have freed
phb, because pcibios_free_controller_deferred() is the release function
for the host_bridge.

If there are no outstanding references when we call device_unregister()
then phb will be freed out from under us.

This has gone mainly unnoticed, but with slub_debug and page_poison
enabled it can lead to a crash:

  PID: 7574   TASK: c0000000d492cb80  CPU: 13  COMMAND: "drmgr"
   #0 [c0000000e4f075a0] crash_kexec at c00000000027d7dc
   #1 [c0000000e4f075d0] oops_end at c000000000029608
   #2 [c0000000e4f07650] __bad_page_fault at c0000000000904b4
   #3 [c0000000e4f076c0] do_bad_slb_fault at c00000000009a5a8
   #4 [c0000000e4f076f0] data_access_slb_common_virt at c000000000008b30
   Data SLB Access [380] exception frame:
   R0:  c000000000167250    R1:  c0000000e4f07a00    R2:  c000000002a46100
   R3:  c000000002b39ce8    R4:  00000000000000c0    R5:  00000000000000a9
   R6:  3894674d000000c0    R7:  0000000000000000    R8:  00000000000000ff
   R9:  0000000000000100    R10: 6b6b6b6b6b6b6b6b    R11: 0000000000008000
   R12: c00000000023da80    R13: c0000009ffd38b00    R14: 0000000000000000
   R15: 000000011c87f0f0    R16: 0000000000000006    R17: 0000000000000003
   R18: 0000000000000002    R19: 0000000000000004    R20: 0000000000000005
   R21: 000000011c87ede8    R22: 000000011c87c5a8    R23: 000000011c87d3a0
   R24: 0000000000000000    R25: 0000000000000001    R26: c0000000e4f07cc8
   R27: c00000004d1cc400    R28: c0080000031d00e8    R29: c00000004d23d800
   R30: c00000004d1d2400    R31: c00000004d1d2540
   NIP: c000000000167258    MSR: 8000000000009033    OR3: c000000000e9f474
   CTR: 0000000000000000    LR:  c000000000167250    XER: 0000000020040003
   CCR: 0000000024088420    MQ:  0000000000000000    DAR: 6b6b6b6b6b6b6ba3
   DSISR: c0000000e4f07920     Syscall Result: fffffffffffffff2
   [NIP  : release_resource+56]
   [LR   : release_resource+48]
   #5 [c0000000e4f07a00] release_resource at c000000000167258  (unreliable)
   #6 [c0000000e4f07a30] remove_phb_dynamic at c000000000105648
   #7 [c0000000e4f07ab0] dlpar_remove_slot at c0080000031a09e8 [rpadlpar_io]
   #8 [c0000000e4f07b50] remove_slot_store at c0080000031a0b9c [rpadlpar_io]
   #9 [c0000000e4f07be0] kobj_attr_store at c000000000817d8c
  #10 [c0000000e4f07c00] sysfs_kf_write at c00000000063e504
  #11 [c0000000e4f07c20] kernfs_fop_write_iter at c00000000063d868
  #12 [c0000000e4f07c70] new_sync_write at c00000000054339c
  #13 [c0000000e4f07d10] vfs_write at c000000000546624
  #14 [c0000000e4f07d60] ksys_write at c0000000005469f4
  #15 [c0000000e4f07db0] system_call_exception at c000000000030840
  #16 [c0000000e4f07e10] system_call_vectored_common at c00000000000c168

To avoid it, we can take a reference to the host_bridge->dev until we're
done using phb. Then when we drop the reference the phb will be freed.

Fixes: 2dd9c11 ("powerpc/pseries: use pci_host_bridge.release_fn() to kfree(phb)")
Reported-by: David Dai <zdai@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Tested-by: Sachin Sant <sachinp@linux.ibm.com>
Link: https://lore.kernel.org/r/20220318034219.1188008-1-mpe@ellerman.id.au
Signed-off-by: Sasha Levin <sashal@kernel.org>
popcornmix pushed a commit that referenced this issue May 4, 2022
kernfs_remove supported NULL kernfs_node param to bail out but revent
per-fs lock change introduced regression that dereferencing the
param without NULL check so kernel goes crash.

This patch checks the NULL kernfs_node in kernfs_remove and if so,
just return.

Quote from bug report by Jirka

```
The bug is triggered by running NAS Parallel benchmark suite on
SuperMicro servers with 2x Xeon(R) Gold 6126 CPU. Here is the error
log:

[  247.035564] BUG: kernel NULL pointer dereference, address: 0000000000000008
[  247.036009] #PF: supervisor read access in kernel mode
[  247.036009] #PF: error_code(0x0000) - not-present page
[  247.036009] PGD 0 P4D 0
[  247.036009] Oops: 0000 [#1] PREEMPT SMP PTI
[  247.058060] CPU: 1 PID: 6546 Comm: umount Not tainted
5.16.0393c3714081a53795bbff0e985d24146def6f57f+ #16
[  247.058060] Hardware name: Supermicro Super Server/X11DDW-L, BIOS
2.0b 03/07/2018
[  247.058060] RIP: 0010:kernfs_remove+0x8/0x50
[  247.058060] Code: 4c 89 e0 5b 5d 41 5c 41 5d 41 5e c3 49 c7 c4 f4
ff ff ff eb b2 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00
41 54 55 <48> 8b 47 08 48 89 fd 48 85 c0 48 0f 44 c7 4c 8b 60 50 49 83
c4 60
[  247.058060] RSP: 0018:ffffbbfa48a27e48 EFLAGS: 00010246
[  247.058060] RAX: 0000000000000001 RBX: ffffffff89e31f98 RCX: 0000000080200018
[  247.058060] RDX: 0000000080200019 RSI: fffff6760786c900 RDI: 0000000000000000
[  247.058060] RBP: ffffffff89e31f98 R08: ffff926b61b24d00 R09: 0000000080200018
[  247.122048] R10: ffff926b61b24d00 R11: ffff926a8040c000 R12: ffff927bd09a2000
[  247.122048] R13: ffffffff89e31fa0 R14: dead000000000122 R15: dead000000000100
[  247.122048] FS:  00007f01be0a8c40(0000) GS:ffff926fa8e40000(0000)
knlGS:0000000000000000
[  247.122048] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  247.122048] CR2: 0000000000000008 CR3: 00000001145c6003 CR4: 00000000007706e0
[  247.122048] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  247.122048] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  247.122048] PKRU: 55555554
[  247.122048] Call Trace:
[  247.122048]  <TASK>
[  247.122048]  rdt_kill_sb+0x29d/0x350
[  247.122048]  deactivate_locked_super+0x36/0xa0
[  247.122048]  cleanup_mnt+0x131/0x190
[  247.122048]  task_work_run+0x5c/0x90
[  247.122048]  exit_to_user_mode_prepare+0x229/0x230
[  247.122048]  syscall_exit_to_user_mode+0x18/0x40
[  247.122048]  do_syscall_64+0x48/0x90
[  247.122048]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  247.122048] RIP: 0033:0x7f01be2d735b
```

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215696
Link: https://lore.kernel.org/lkml/CAE4VaGDZr_4wzRn2___eDYRtmdPaGGJdzu_LCSkJYuY9BEO3cw@mail.gmail.com/
Fixes: 393c371 (kernfs: switch global kernfs_rwsem lock to per-fs lock)
Cc: stable@vger.kernel.org
Reported-by: Jirka Hladky <jhladky@redhat.com>
Tested-by: Jirka Hladky <jhladky@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Link: https://lore.kernel.org/r/20220427172152.3505364-1-minchan@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
xukuohai pushed a commit to xukuohai/linux-raspberry-pi that referenced this issue May 9, 2022
The BPF STX/LDX instruction uses offset relative to the FP to address
stack space. Since the BPF_FP locates at the top of the frame, the offset
is usually a negative number. However, arm64 str/ldr immediate instruction
requires that offset be a positive number.  Therefore, this patch tries to
convert the offsets.

The method is to find the negative offset furthest from the FP firstly.
Then add it to the FP, calculate a bottom position, called FPB, and then
adjust the offsets in other STR/LDX instructions relative to FPB.

FPB is saved using the callee-saved register x27 of arm64 which is not
used yet.

Before adjusting the offset, the patch checks every instruction to ensure
that the FP does not change in run-time. If the FP may change, no offset
is adjusted.

For example, for the following bpftrace command:

  bpftrace -e 'kprobe:do_sys_open { printf("opening: %s\n", str(arg1)); }'

Without this patch, jited code(fragment):

   0:   bti     c
   4:   stp     x29, x30, [sp, #-16]!
   8:   mov     x29, sp
   c:   stp     x19, x20, [sp, #-16]!
  10:   stp     x21, x22, [sp, #-16]!
  14:   stp     x25, x26, [sp, #-16]!
  18:   mov     x25, sp
  1c:   mov     x26, #0x0                       // #0
  20:   bti     j
  24:   sub     sp, sp, #0x90
  28:   add     x19, x0, #0x0
  2c:   mov     x0, #0x0                        // #0
  30:   mov     x10, #0xffffffffffffff78        // #-136
  34:   str     x0, [x25, x10]
  38:   mov     x10, #0xffffffffffffff80        // #-128
  3c:   str     x0, [x25, x10]
  40:   mov     x10, #0xffffffffffffff88        // #-120
  44:   str     x0, [x25, x10]
  48:   mov     x10, #0xffffffffffffff90        // #-112
  4c:   str     x0, [x25, x10]
  50:   mov     x10, #0xffffffffffffff98        // #-104
  54:   str     x0, [x25, x10]
  58:   mov     x10, #0xffffffffffffffa0        // #-96
  5c:   str     x0, [x25, x10]
  60:   mov     x10, #0xffffffffffffffa8        // #-88
  64:   str     x0, [x25, x10]
  68:   mov     x10, #0xffffffffffffffb0        // #-80
  6c:   str     x0, [x25, x10]
  70:   mov     x10, #0xffffffffffffffb8        // #-72
  74:   str     x0, [x25, x10]
  78:   mov     x10, #0xffffffffffffffc0        // #-64
  7c:   str     x0, [x25, x10]
  80:   mov     x10, #0xffffffffffffffc8        // #-56
  84:   str     x0, [x25, x10]
  88:   mov     x10, #0xffffffffffffffd0        // #-48
  8c:   str     x0, [x25, x10]
  90:   mov     x10, #0xffffffffffffffd8        // #-40
  94:   str     x0, [x25, x10]
  98:   mov     x10, #0xffffffffffffffe0        // #-32
  9c:   str     x0, [x25, x10]
  a0:   mov     x10, #0xffffffffffffffe8        // #-24
  a4:   str     x0, [x25, x10]
  a8:   mov     x10, #0xfffffffffffffff0        // #-16
  ac:   str     x0, [x25, x10]
  b0:   mov     x10, #0xfffffffffffffff8        // #-8
  b4:   str     x0, [x25, x10]
  b8:   mov     x10, #0x8                       // raspberrypi#8
  bc:   ldr     x2, [x19, x10]
  [...]

With this patch, jited code(fragment):

   0:   bti     c
   4:   stp     x29, x30, [sp, #-16]!
   8:   mov     x29, sp
   c:   stp     x19, x20, [sp, #-16]!
  10:   stp     x21, x22, [sp, #-16]!
  14:   stp     x25, x26, [sp, #-16]!
  18:   stp     x27, x28, [sp, #-16]!
  1c:   mov     x25, sp
  20:   sub     x27, x25, #0x88
  24:   mov     x26, #0x0                       // #0
  28:   bti     j
  2c:   sub     sp, sp, #0x90
  30:   add     x19, x0, #0x0
  34:   mov     x0, #0x0                        // #0
  38:   str     x0, [x27]
  3c:   str     x0, [x27, raspberrypi#8]
  40:   str     x0, [x27, raspberrypi#16]
  44:   str     x0, [x27, raspberrypi#24]
  48:   str     x0, [x27, raspberrypi#32]
  4c:   str     x0, [x27, raspberrypi#40]
  50:   str     x0, [x27, raspberrypi#48]
  54:   str     x0, [x27, raspberrypi#56]
  58:   str     x0, [x27, raspberrypi#64]
  5c:   str     x0, [x27, raspberrypi#72]
  60:   str     x0, [x27, raspberrypi#80]
  64:   str     x0, [x27, raspberrypi#88]
  68:   str     x0, [x27, raspberrypi#96]
  6c:   str     x0, [x27, raspberrypi#104]
  70:   str     x0, [x27, raspberrypi#112]
  74:   str     x0, [x27, raspberrypi#120]
  78:   str     x0, [x27, raspberrypi#128]
  7c:   ldr     x2, [x19, raspberrypi#8]
  [...]

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20220321152852.2334294-4-xukuohai@huawei.com
herrnst pushed a commit to herrnst/linux-raspberrypi that referenced this issue May 9, 2022
commit ad8d869 upstream.

kernfs_remove supported NULL kernfs_node param to bail out but revent
per-fs lock change introduced regression that dereferencing the
param without NULL check so kernel goes crash.

This patch checks the NULL kernfs_node in kernfs_remove and if so,
just return.

Quote from bug report by Jirka

```
The bug is triggered by running NAS Parallel benchmark suite on
SuperMicro servers with 2x Xeon(R) Gold 6126 CPU. Here is the error
log:

[  247.035564] BUG: kernel NULL pointer dereference, address: 0000000000000008
[  247.036009] #PF: supervisor read access in kernel mode
[  247.036009] #PF: error_code(0x0000) - not-present page
[  247.036009] PGD 0 P4D 0
[  247.036009] Oops: 0000 [raspberrypi#1] PREEMPT SMP PTI
[  247.058060] CPU: 1 PID: 6546 Comm: umount Not tainted
5.16.0393c3714081a53795bbff0e985d24146def6f57f+ raspberrypi#16
[  247.058060] Hardware name: Supermicro Super Server/X11DDW-L, BIOS
2.0b 03/07/2018
[  247.058060] RIP: 0010:kernfs_remove+0x8/0x50
[  247.058060] Code: 4c 89 e0 5b 5d 41 5c 41 5d 41 5e c3 49 c7 c4 f4
ff ff ff eb b2 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00
41 54 55 <48> 8b 47 08 48 89 fd 48 85 c0 48 0f 44 c7 4c 8b 60 50 49 83
c4 60
[  247.058060] RSP: 0018:ffffbbfa48a27e48 EFLAGS: 00010246
[  247.058060] RAX: 0000000000000001 RBX: ffffffff89e31f98 RCX: 0000000080200018
[  247.058060] RDX: 0000000080200019 RSI: fffff6760786c900 RDI: 0000000000000000
[  247.058060] RBP: ffffffff89e31f98 R08: ffff926b61b24d00 R09: 0000000080200018
[  247.122048] R10: ffff926b61b24d00 R11: ffff926a8040c000 R12: ffff927bd09a2000
[  247.122048] R13: ffffffff89e31fa0 R14: dead000000000122 R15: dead000000000100
[  247.122048] FS:  00007f01be0a8c40(0000) GS:ffff926fa8e40000(0000)
knlGS:0000000000000000
[  247.122048] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  247.122048] CR2: 0000000000000008 CR3: 00000001145c6003 CR4: 00000000007706e0
[  247.122048] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  247.122048] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  247.122048] PKRU: 55555554
[  247.122048] Call Trace:
[  247.122048]  <TASK>
[  247.122048]  rdt_kill_sb+0x29d/0x350
[  247.122048]  deactivate_locked_super+0x36/0xa0
[  247.122048]  cleanup_mnt+0x131/0x190
[  247.122048]  task_work_run+0x5c/0x90
[  247.122048]  exit_to_user_mode_prepare+0x229/0x230
[  247.122048]  syscall_exit_to_user_mode+0x18/0x40
[  247.122048]  do_syscall_64+0x48/0x90
[  247.122048]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  247.122048] RIP: 0033:0x7f01be2d735b
```

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215696
Link: https://lore.kernel.org/lkml/CAE4VaGDZr_4wzRn2___eDYRtmdPaGGJdzu_LCSkJYuY9BEO3cw@mail.gmail.com/
Fixes: 393c371 (kernfs: switch global kernfs_rwsem lock to per-fs lock)
Cc: stable@vger.kernel.org
Reported-by: Jirka Hladky <jhladky@redhat.com>
Tested-by: Jirka Hladky <jhladky@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Link: https://lore.kernel.org/r/20220427172152.3505364-1-minchan@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
popcornmix pushed a commit that referenced this issue Jul 18, 2022
[ Upstream commit e4a41c2 ]

The following error is reported when running "./test_progs -t for_each"
under arm64:

  bpf_jit: multi-func JIT bug 58 != 56
  [...]
  JIT doesn't support bpf-to-bpf calls

The root cause is the size of BPF_PSEUDO_FUNC instruction increases
from 2 to 3 after the address of called bpf-function is settled and
there are two bpf-to-bpf calls in test_pkt_access. The generated
instructions are shown below:

  0x48:  21 00 C0 D2    movz x1, #0x1, lsl #32
  0x4c:  21 00 80 F2    movk x1, #0x1

  0x48:  E1 3F C0 92    movn x1, #0x1ff, lsl #32
  0x4c:  41 FE A2 F2    movk x1, #0x17f2, lsl #16
  0x50:  81 70 9F F2    movk x1, #0xfb84

Fixing it by using emit_addr_mov_i64() for BPF_PSEUDO_FUNC, so
the size of jited image will not change.

Fixes: 69c087b ("bpf: Add bpf_for_each_map_elem() helper")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20211231151018.3781550-1-houtao1@huawei.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants