Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PCIe-port not working on RK3399 #116

Open
k-a-z-u opened this issue Aug 10, 2018 · 27 comments
Open

PCIe-port not working on RK3399 #116

k-a-z-u opened this issue Aug 10, 2018 · 27 comments

Comments

@k-a-z-u
Copy link

k-a-z-u commented Aug 10, 2018

We have a RockPro64 board here and tried to get the pcie port working.
Depending on the card that was inserted, the port was either simply disabled, or the kernel panicked during boot.

Dmesg when no card is in the slot (4.4.138-1094):

[    3.361232] rockchip-pcie f8000000.pcie: GPIO lookup for consumer ep
[    3.361239] rockchip-pcie f8000000.pcie: using device tree for GPIO lookup
[    3.361260] of_get_named_gpiod_flags: parsed 'ep-gpios' property of node '/pcie@f8000000[0]' - status (0)
[    3.361450] rockchip-pcie f8000000.pcie: Looking up vpcie3v3-supply from device tree
[    3.361533] rockchip-pcie f8000000.pcie: Looking up vpcie1v8-supply from device tree
[    3.361539] rockchip-pcie f8000000.pcie: Looking up vpcie1v8-supply property in node /pcie@f8000000 failed
[    3.361556] rockchip-pcie f8000000.pcie: no vpcie1v8 regulator found
[    3.366975] rockchip-pcie f8000000.pcie: Looking up vpcie0v9-supply from device tree
[    3.366992] rockchip-pcie f8000000.pcie: Looking up vpcie0v9-supply property in node /pcie@f8000000 failed
[    3.367016] rockchip-pcie f8000000.pcie: no vpcie0v9 regulator found
[    3.371361] rockchip-pcie f8000000.pcie: invalid power supply
[    3.620975] EXT4-fs (mmcblk0p7): mounted filesystem with writeback data mode. Opts: (null)
[    3.875231] rockchip-pcie f8000000.pcie: PCIe link training gen1 timeout!
[    3.890425] rockchip-pcie: probe of f8000000.pcie failed with error -110

With cards like Dell PowerEdge Perc 5i SAS RAID Controller, the kernel seldomly boots, ignoring the card and giving the dmesg-output from above, but most of the times, the kernel crashes with a couple of different stack-traces.

This is a stack-trace with this card using a mainline kernel. In contrast to this kernel, mainline continues booting, thus we were able to copy it out:

[    7.044117] Hardware name: Pine64 RockPro64 (DT)
[    7.044656] pstate: 60000085 (nZCv daIf -PAN -UAO)
[    7.045223] pc : rockchip_pcie_rd_conf+0x18c/0x1f8 [pcie_rockchip_host]
[    7.045989] lr : rockchip_pcie_rd_conf+0x17c/0x1f8 [pcie_rockchip_host]
[    7.046748] sp : ffff00000e6f3730
[    7.047137] x29: ffff00000e6f3730 x28: 0000000000000001 
[    7.047754] x27: 0000000000000000 x26: 0000000000000000 
[    7.048372] x25: 0000000000000000 x24: ffff00000e6f3854 
[    7.048990] x23: ffff8000f1557800 x22: ffff00000e6f37b4 
[    7.049607] x21: ffff8000f0f5c398 x20: 0000000000000004 
[    7.050225] x19: ffff000010100000 x18: ffffffffffffffff 
[    7.050842] x17: 0000000000000000 x16: 0000000000000000 
[    7.051460] x15: 0000000000000000 x14: 000000000000024e 
[    7.052077] x13: 0000000000000001 x12: 0000000000000000 
[    7.052694] x11: 0000000000000001 x10: 0000000000000960 
[    7.064723] x9 : 0000000000000000 x8 : 0000000000000000 
[    7.076596] x7 : 0000000000000000 x6 : 0000000000000000 
[    7.088340] x5 : 0000000000100000 x4 : 0000000000c00008 
[    7.099979] x3 : ffff000013000000 x2 : 000000000080000a 
[    7.111498] x1 : ffff000013c00008 x0 : ffff000010000000 
[    7.122902] Process systemd-udevd (pid: 2317, stack limit = 0x(____ptrval____))
[    7.134519] Call trace:
[    7.145583]  rockchip_pcie_rd_conf+0x18c/0x1f8 [pcie_rockchip_host]
[    7.157044]  pci_bus_read_config_dword+0x84/0xe0
[    7.168278]  pci_bus_read_dev_vendor_id+0x2c/0x1a0
[    7.179422]  pci_scan_single_device+0x78/0xf8
[    7.190431]  pci_scan_slot+0x34/0xf0
[    7.201243]  pci_scan_child_bus_extend+0x50/0x290
[    7.212087]  pci_scan_bridge_extend+0x2ec/0x4e0
[    7.222814]  pci_scan_child_bus_extend+0x1e4/0x290
[    7.233469]  pci_scan_root_bus_bridge+0x58/0xd8
[    7.244022]  rockchip_pcie_probe+0x60c/0x750 [pcie_rockchip_host]
[    7.254833]  platform_drv_probe+0x50/0xa0
[    7.265419]  driver_probe_device+0x208/0x2e8
[    7.275995]  __driver_attach+0xd4/0xd8
[    7.286409]  bus_for_each_dev+0x74/0xc8
[    7.296708]  driver_attach+0x20/0x28
[    7.306848]  bus_add_driver+0x1ac/0x218
[    7.316892]  driver_register+0x60/0x110
[    7.326831]  __platform_driver_register+0x40/0x48
[    7.336604]  rockchip_pcie_driver_init+0x20/0x1000 [pcie_rockchip_host]
[    7.336621]  do_one_initcall+0x5c/0x178
[    7.355610]  do_init_module+0x58/0x1b0
[    7.364755]  load_module+0x1e14/0x2210
[    7.373778]  sys_finit_module+0xcc/0xe8
[    7.382700]  __sys_trace_return+0x0/0x4
[    7.391247] Code: 7100129f 54fff921 f94002a0 8b130013 (b9400273) 
[    7.399680] ---[ end trace 706cbd252753b386 ]---
@foundObjects
Copy link

I'm having exactly the same issue with 3 of the 4 cards I've plugged into my RockPro64. The only card that functions as expected is an Intel I350-T4, the three Mellanox cards I've attempted to use all cause pcie initialization to fail.

I'll get a serial cable out later and dump a full boot log with the 4.4 kernel and mainline.

@foundObjects
Copy link

foundObjects commented Aug 11, 2018

Bootlogs below.

Kernel 4.4.132-1075 (ayufan 0.7.9):
Intel I350-T4 -- works perfectly
Mellanox ConnectX-2 MPNA19-XTR -- crashes, trace included
Mellanox ConnectX-2 MHQH29C -- "rockchip-pcie: probe of f8000000.pcie failed with error -110" doesn't crash

Kernel 4.18.0-rc8-1060 (ayufan):
Intel I350-T4 -- working perfectly
Mellanox ConnectX-2 MPNA19-XTR -- fails with call trace, doesn't crash
Mellanox ConnectX-2 MHQH29C -- "rockchip-pcie: probe of f8000000.pcie failed with error -110" doesn't crash

@hopkinskong
Copy link

Having the same issue, cross ref:
ayufan-rock64/linux-build#254

@luckcolors
Copy link

Can we please have some updates on this issue?

@foundObjects
Copy link

Anyone? I'm about ready to sell my RockPro64 and just use x86_64 for my project.

I'm 100% willing to supply any debug information you might need, and I've got about a dozen different PCIe network cards here I can test with.

0lvin pushed a commit to free-z4u/roc-rk3328-cc-official that referenced this issue Sep 9, 2018
commit 4ea7701 upstream.

When running kill(72057458746458112, 0) in userspace I hit the following
issue.

  UBSAN: Undefined behaviour in kernel/signal.c:1462:11
  negation of -2147483648 cannot be represented in type 'int':
  CPU: 226 PID: 9849 Comm: test Tainted: G    B          ---- -------   3.10.0-327.53.58.70.x86_64_ubsan+ rockchip-linux#116
  Hardware name: Huawei Technologies Co., Ltd. RH8100 V3/BC61PBIA, BIOS BLHSV028 11/11/2014
  Call Trace:
    dump_stack+0x19/0x1b
    ubsan_epilogue+0xd/0x50
    __ubsan_handle_negate_overflow+0x109/0x14e
    SYSC_kill+0x43e/0x4d0
    SyS_kill+0xe/0x10
    system_call_fastpath+0x16/0x1b

Add code to avoid the UBSAN detection.

[akpm@linux-foundation.org: tweak comment]
Link: http://lkml.kernel.org/r/1496670008-59084-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhongjiang <zhongjiang@huawei.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
rkchrome pushed a commit that referenced this issue Sep 26, 2018
commit 4ea7701 upstream.

When running kill(72057458746458112, 0) in userspace I hit the following
issue.

  UBSAN: Undefined behaviour in kernel/signal.c:1462:11
  negation of -2147483648 cannot be represented in type 'int':
  CPU: 226 PID: 9849 Comm: test Tainted: G    B          ---- -------   3.10.0-327.53.58.70.x86_64_ubsan+ #116
  Hardware name: Huawei Technologies Co., Ltd. RH8100 V3/BC61PBIA, BIOS BLHSV028 11/11/2014
  Call Trace:
    dump_stack+0x19/0x1b
    ubsan_epilogue+0xd/0x50
    __ubsan_handle_negate_overflow+0x109/0x14e
    SYSC_kill+0x43e/0x4d0
    SyS_kill+0xe/0x10
    system_call_fastpath+0x16/0x1b

Add code to avoid the UBSAN detection.

[akpm@linux-foundation.org: tweak comment]
Link: http://lkml.kernel.org/r/1496670008-59084-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhongjiang <zhongjiang@huawei.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
@rich0
Copy link

rich0 commented Nov 30, 2018

I'm running into similar issues with an LSI HBA card. It works without issue in a standard x86 motherboard. gen1 training times out when the board is inserted and nothing shows up in lspci. If I plug in a USB3 host card it seems to work fine, so the PCIe slot itself is fine (gen2.1 rockpro64 board).

I built 4.4.154-1124-rockchip-ayufan with PCI_DEBUG enabled and captured this dmesg output with the LSI card installed:
https://pastebin.com/SAZPFpXD

Now, this card is an 8x card in a 4x slot, so as an experiment I ran it through a 1x PCIe mining adapter. It works just fine in an x86 motherboard in this config. When I use this with the rockpro64 I get a new error:
https://pastebin.com/QmyqyNNX

To try something different I rebuilt the kernel, but extending the gen1 PCIe training timeout from 500ms to 5s (drivers/pci/host/pcie-rockchip.c line 619). The board boots normally without the LSI card, just giving the usual gen1 timeout message. If I boot it with the LSI card installed directly (no 1x adapter) now I get the error again:
https://pastebin.com/vixEZKr4

So, perhaps it is timing out too quickly or taking too long to train without the 1x adapter, which prevents it from getting to the error. For some reason it trains faster with the 1x adapter.

Apologies if this is an unrelated issue - if so I'm happy to create a new one. I'm happy to test anything at this point.

@nuumio
Copy link

nuumio commented Dec 19, 2018

Any news about this? I'm having same the same problem with LSI 9201 card. If it's of any help here's few logs from my Rockpro64.

With 4.4 kernel (ayufan's), serial console log (3 crashes):

With 4.20-rc6 (ayufan's + patch to disable mmc command queueing), serial console log (3 crashes) and dmesg from last attempt:

Edit: Like @rich0 above I tested the card in x86 setup (Ubuntu 18.04, 64bit). With it the card work in both PCIe3 16x and 1x slots and lscpi shows (full output pastebin: https://pastebin.com/fuyiB4Dm):
05:00.0 Serial Attached SCSI controller [0107]: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] [1000:0072] (rev 03)

@wombat
Copy link

wombat commented Jan 8, 2019

I am having the same issue with a Delock PCI Express Card > Mini PCIe adapter connected to a Telit LM960 LTE module: https://pastebin.com/FHMGRgVG

@foundObjects
Copy link

Has there been any motion on this at all? I'm sitting on what's effectively a useless board for my application (10GbE routing) at the moment since I can't bring up any of the PCIe NICs I've tested (I've tried around 10 different NICs at this point.) The only NICs I've managed to use successfully are Intel i350 and similar boards.

@samex
Copy link

samex commented Mar 12, 2019

Didn´t debug why is my rockpro64 not booting with an LSI 3Ware 9650 , but I bet I have the same issues like above some people reporting.

Would be nice to know where we can start to solve that problem?

@nuumio
Copy link

nuumio commented Apr 3, 2019

Updating my status: Looks like I got my LSI 9201 working at least with one SSD drive. For some reason PCIe driver seems to need some delay between training and bus scanning. I built a test kernel with this workaround on top of ayufan's latest 4.4: https://github.com/nuumio/linux-kernel/releases/tag/nuumio-4.4-pcie-scan-sleep-02

The most relevant change is: nuumio@5a65b17 (in branch: https://github.com/nuumio/linux-kernel/commits/nuumio-4.4-pcie-scan-sleep).

Last time I tried this with a bit older kernel I got the controller up but it kept resetting the connection to SSD every few seconds. Now with more patches it seems somewhat stable. I have no idea about the root cause but hopefully this gives ideas where the actual problem is. Curiously the delay needed for this is about the same that was needed earlier for deferring SDIO initialization to get WiFi/BT module and PCIe working at the same time for Rockpro64 (that was finally done so that SDIO driver waits until PCIe is finished).

My current setup:

  • Rockpro64 4GB (power from 12V 6A PSU)
  • LSI 9201
  • WD Blue 250GB SSD (power from ATX PSU)
  • eMMC as boot + rootfs media
  • Pine64 WiFi/BT module connected, no WiFi nor BT configured
  • Network via ethernet
  • OS: ayufan's bionic minimal build + my own kernel

4.4-development branch seems quite active currently. I hope you get this one resolved too :)

@rich0
Copy link

rich0 commented Apr 6, 2019

Just to comment for the record and the benefit of the many others with this issue, nuumio's patch (which seems to be in line to be released on ayufan) fixes my issue. You just need to set a command line parameter to enable the delay (I haven't worked out the minimum required delay yet).

I was also having power issues which were solved by a 1x mining adapter. Using a 5A power supply is likely to address that problem though nobody has tested the whole thing under heavy load yet.

I'll be doing actual testing of the drives/etc but for now I can get the HBA to show up in lspci. ayufan also enabled LSI HBAs in his kernels.

@RchGrav
Copy link

RchGrav commented Jun 1, 2019

I was also having power issues which were solved by a 1x mining adapter.

I bet the riser is shorting PCIe pin A1 to B17 to provide "presence" as a 1x card.. See https://imgur.com/a/AJB71Ih

Shorting pin A1 (PRSNT1) to the second presence pin B31 (PRSNT2) would make a PCIe card detect as a 4x.* (The presence pins are a little bit shorter.)

https://electronics.stackexchange.com/questions/201437/pcie-prsnt-signal-connection

Explanation: PCIe Cards short these presence pins to indicate the number of PCIe lanes / BUS width the connection will be using.

(Note: I don't yet have a RockPro64 to test this on yet.. but same thing would be happening on an Intel Chipset without these pins jumped. Here is an example of shorting the pin on the riser https://imgur.com/a/4rl7T5I taken from my plex server...)

@rich0
Copy link

rich0 commented Jun 2, 2019

The card works fine on a powered 16x riser cable as well, like this one:
https://www.amazon.com/gp/product/B01NAE4O7I/

The only downside to this powered riser is that it seems to drive power back into the rockpro64 such that it remains powered on even after disconnected from the power supply. I don't generally run it this way as I am not certain that drawing current in this way isn't harmful.

So, aside from the likely power issue, the current ayufan kernels address my issues.

@foundObjects
Copy link

foundObjects commented Jun 4, 2019 via email

@StuartIanNaylor
Copy link

StuartIanNaylor commented Jun 7, 2019

I am the same with a rockpi4b

[    1.489159] of_get_named_gpiod_flags: parsed 'gpio' property of node '/vcc3v3-pcie-regulator[0]' - status (0)
[    1.489201] reg-fixed-voltage vcc3v3-pcie-regulator: Looking up vin-supply from device tree
[    1.489236] vcc3v3_pcie: supplied by vcc3v3_sys
[    1.489697] vcc3v3_pcie: at 3300 mV 
[    1.489857] reg-fixed-voltage vcc3v3-pcie-regulator: vcc3v3_pcie supplying 0uV
[    1.623989] phy phy-pcie-phy.9: Looking up phy-supply from device tree
[    1.623999] phy phy-pcie-phy.9: Looking up phy-supply property in node /pcie-phy failed
[    1.625451] rockchip-pcie f8000000.pcie: GPIO lookup for consumer ep
[    1.625461] rockchip-pcie f8000000.pcie: using device tree for GPIO lookup
[    1.625490] of_get_named_gpiod_flags: parsed 'ep-gpios' property of node '/pcie@f8000000[0]' - status (0)
[    1.625725] rockchip-pcie f8000000.pcie: Looking up vpcie3v3-supply from device tree
[    1.625736] rockchip-pcie f8000000.pcie: Looking up vpcie3v3-supply property in node /pcie@f8000000 failed
[    1.625748] rockchip-pcie f8000000.pcie: no vpcie3v3 regulator found
[    1.626340] rockchip-pcie f8000000.pcie: Looking up vpcie1v8-supply from device tree
[    1.626350] rockchip-pcie f8000000.pcie: Looking up vpcie1v8-supply property in node /pcie@f8000000 failed
[    1.626360] rockchip-pcie f8000000.pcie: no vpcie1v8 regulator found
[    1.626951] rockchip-pcie f8000000.pcie: Looking up vpcie0v9-supply from device tree
[    1.626960] rockchip-pcie f8000000.pcie: Looking up vpcie0v9-supply property in node /pcie@f8000000 failed
[    1.626970] rockchip-pcie f8000000.pcie: no vpcie0v9 regulator found
[    2.172391] rockchip-pcie f8000000.pcie: PCIe link training gen1 timeout!
[    2.173158] rockchip-pcie: probe of f8000000.pcie failed with error -110

lspci returns nothing then other times I will boot and

rock@linux:/boot$ lspci
00:00.0 PCI bridge: Fuzhou Rockchip Electronics Co., Ltd Device 0100
01:00.0 SATA controller: ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 02)
rock@linux:~$ dmesg | grep pci
[    1.489111] of_get_named_gpiod_flags: parsed 'gpio' property of node '/vcc3v3                 -pcie-regulator[0]' - status (0)
[    1.489153] reg-fixed-voltage vcc3v3-pcie-regulator: Looking up vin-supply fr                 om device tree
[    1.489188] vcc3v3_pcie: supplied by vcc3v3_sys
[    1.489649] vcc3v3_pcie: at 3300 mV
[    1.489808] reg-fixed-voltage vcc3v3-pcie-regulator: vcc3v3_pcie supplying 0u                 V
[    1.623688] phy phy-pcie-phy.9: Looking up phy-supply from device tree
[    1.623698] phy phy-pcie-phy.9: Looking up phy-supply property in node /pcie-                 phy failed
[    1.625172] rockchip-pcie f8000000.pcie: GPIO lookup for consumer ep
[    1.625182] rockchip-pcie f8000000.pcie: using device tree for GPIO lookup
[    1.625211] of_get_named_gpiod_flags: parsed 'ep-gpios' property of node '/pc                 ie@f8000000[0]' - status (0)
[    1.625452] rockchip-pcie f8000000.pcie: Looking up vpcie3v3-supply from devi                 ce tree
[    1.625462] rockchip-pcie f8000000.pcie: Looking up vpcie3v3-supply property                  in node /pcie@f8000000 failed
[    1.625475] rockchip-pcie f8000000.pcie: no vpcie3v3 regulator found
[    1.626067] rockchip-pcie f8000000.pcie: Looking up vpcie1v8-supply from devi                 ce tree
[    1.626077] rockchip-pcie f8000000.pcie: Looking up vpcie1v8-supply property                  in node /pcie@f8000000 failed
[    1.626087] rockchip-pcie f8000000.pcie: no vpcie1v8 regulator found
[    1.626675] rockchip-pcie f8000000.pcie: Looking up vpcie0v9-supply from devi                 ce tree
[    1.626684] rockchip-pcie f8000000.pcie: Looking up vpcie0v9-supply property                  in node /pcie@f8000000 failed
[    1.626694] rockchip-pcie f8000000.pcie: no vpcie0v9 regulator found
[    1.810499] PCI host bridge /pcie@f8000000 ranges:
[    1.812367] rockchip-pcie f8000000.pcie: PCI host bridge to bus 0000:00
[    1.813004] pci_bus 0000:00: root bus resource [bus 00-1f]
[    1.813531] pci_bus 0000:00: root bus resource [mem 0xfa000000-0xfbdfffff]
[    1.814190] pci_bus 0000:00: root bus resource [io  0x0000-0xfffff] (bus addr                 ess [0xfbe00000-0xfbefffff])
[    1.815130] pci 0000:00:00.0: [1d87:0100] type 01 class 0x060400
[    1.815242] pci 0000:00:00.0: supports D1
[    1.815253] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
[    1.815619] pci 0000:00:00.0: bridge configuration invalid ([bus 00-00]), rec                 onfiguring
[    1.816531] pci_bus 0000:01: busn_res: can not insert [bus 01-ff] under [bus                  00-1f] (conflicts with (null) [bus 00-1f])
[    1.816574] pci 0000:01:00.0: [1b21:0612] type 00 class 0x010601
[    1.816628] pci 0000:01:00.0: reg 0x10: initial BAR value 0x00000000 invalid
[    1.817300] pci 0000:01:00.0: reg 0x10: [io  size 0x0008]
[    1.817321] pci 0000:01:00.0: reg 0x14: initial BAR value 0x00000000 invalid
[    1.817993] pci 0000:01:00.0: reg 0x14: [io  size 0x0004]
[    1.818013] pci 0000:01:00.0: reg 0x18: initial BAR value 0x00000000 invalid
[    1.818685] pci 0000:01:00.0: reg 0x18: [io  size 0x0008]
[    1.818705] pci 0000:01:00.0: reg 0x1c: initial BAR value 0x00000000 invalid
[    1.819377] pci 0000:01:00.0: reg 0x1c: [io  size 0x0004]
[    1.819397] pci 0000:01:00.0: reg 0x20: initial BAR value 0x00000000 invalid
[    1.820069] pci 0000:01:00.0: reg 0x20: [io  size 0x0020]
[    1.820090] pci 0000:01:00.0: reg 0x24: [mem 0x00000000-0x000001ff]
[    1.820111] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0000ffff pref]
[    1.828295] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01
[    1.828341] pci 0000:00:00.0: BAR 8: assigned [mem 0xfa000000-0xfa0fffff]
[    1.828996] pci 0000:01:00.0: BAR 6: assigned [mem 0xfa000000-0xfa00ffff pref                 ]
[    1.829687] pci 0000:01:00.0: BAR 5: assigned [mem 0xfa010000-0xfa0101ff]
[    1.830340] pci 0000:01:00.0: BAR 4: no space for [io  size 0x0020]
[    1.830939] pci 0000:01:00.0: BAR 4: failed to assign [io  size 0x0020]
[    1.831570] pci 0000:01:00.0: BAR 0: no space for [io  size 0x0008]
[    1.832169] pci 0000:01:00.0: BAR 0: failed to assign [io  size 0x0008]
[    1.832817] pci 0000:01:00.0: BAR 2: no space for [io  size 0x0008]
[    1.833416] pci 0000:01:00.0: BAR 2: failed to assign [io  size 0x0008]
[    1.834048] pci 0000:01:00.0: BAR 1: no space for [io  size 0x0004]
[    1.834646] pci 0000:01:00.0: BAR 1: failed to assign [io  size 0x0004]
[    1.835278] pci 0000:01:00.0: BAR 3: no space for [io  size 0x0004]
[    1.835877] pci 0000:01:00.0: BAR 3: failed to assign [io  size 0x0004]
[    1.836525] pci 0000:00:00.0: PCI bridge to [bus 01]
[    1.837006] pci 0000:00:00.0:   bridge window [mem 0xfa000000-0xfa0fffff]
[    1.837725] pcieport 0000:00:00.0: enabling device (0000 -> 0002)
[    1.838611] pcieport 0000:00:00.0: Signaling PME through PCIe PME interrupt
[    1.839275] pci 0000:01:00.0: Signaling PME through PCIe PME interrupt
[    1.839901] pcie_pme 0000:00:00.0:pcie01: service driver pcie_pme loaded
[    1.840036] aer 0000:00:00.0:pcie02: service driver aer loaded
[    2.001348] ehci-pci: EHCI PCI platform driver

Seems completely spurious sometimes I hits runs of it working sometimes runs not.
Its like when you get 2 services clash that by chance change order at times.

[Edit] I think the bridge has died on me as now with or without I can not get any listing on multiple tries

@prusnak
Copy link

prusnak commented Apr 22, 2020

I see the same issue with the following setup:

[    2.172391] rockchip-pcie f8000000.pcie: PCIe link training gen1 timeout!
[    2.173158] rockchip-pcie: probe of f8000000.pcie failed with error -110

@PhoenixMage
Copy link

I see similar issues to this on a Rock Pi 4 being unable to detect, pcie (error -110) and hence the NVMe m.2 drive running kernel 5.6.7.

The issue is intermittent as the NVMe drive is detected in linux about 5% of the time.

If I use the u-boot provided by radxa (rather then mainline with rockchip patches) this then becomes 100% so I am not sure if there is some form of PCIe initialisation that is happening in the u-boot that resolves the issue. I would still like to see my NVMe working with mainline kernel and mainline u-boot.

@jkoppen-headsfirst
Copy link

Here too a boot freeze when a PCIe adapter is present (PCIe x1 to Mini PCIe adapter with Coral Edge TPU).
This is my output without:

dmesg | grep pci
[ 1.473348] of_get_named_gpiod_flags: parsed 'gpio' property of node '/vcc3v3-pcie-regulator[0]' - status (0)
[ 1.473399] reg-fixed-voltage vcc3v3-pcie-regulator: Looking up vin-supply from device tree
[ 1.473442] vcc3v3_pcie: supplied by dc_12v
[ 1.473509] vcc3v3_pcie: 3300 mV
[ 1.473666] reg-fixed-voltage vcc3v3-pcie-regulator: vcc3v3_pcie supplying 3300000uV
[ 1.892811] phy phy-pcie-phy.5: Looking up phy-supply from device tree
[ 1.892821] phy phy-pcie-phy.5: Looking up phy-supply property in node /pcie-phy failed
[ 1.894568] rockchip-pcie f8000000.pcie: GPIO lookup for consumer ep
[ 1.894578] rockchip-pcie f8000000.pcie: using device tree for GPIO lookup
[ 1.894607] of_get_named_gpiod_flags: parsed 'ep-gpios' property of node '/pcie@f8000000[0]' - status (0)
[ 1.894856] rockchip-pcie f8000000.pcie: Looking up vpcie3v3-supply from device tree
[ 1.894949] rockchip-pcie f8000000.pcie: Looking up vpcie1v8-supply from device tree
[ 1.894960] rockchip-pcie f8000000.pcie: Looking up vpcie1v8-supply property in node /pcie@f8000000 failed
[ 1.894974] rockchip-pcie f8000000.pcie: no vpcie1v8 regulator found
[ 1.895002] rockchip-pcie f8000000.pcie: Looking up vpcie0v9-supply from device tree
[ 1.895013] rockchip-pcie f8000000.pcie: Looking up vpcie0v9-supply property in node /pcie@f8000000 failed
[ 1.895025] rockchip-pcie f8000000.pcie: no vpcie0v9 regulator found
[ 1.895049] rockchip-pcie f8000000.pcie: bus-scan-delay-ms in device tree is 1000 ms
[ 1.895084] rockchip-pcie f8000000.pcie: missing "memory-region" property
[ 1.895121] PCI host bridge /pcie@f8000000 ranges:
[ 1.942370] rockchip-pcie f8000000.pcie: invalid power supply
[ 2.442415] rockchip-pcie f8000000.pcie: PCIe link training gen1 timeout!
[ 2.442463] rockchip-pcie f8000000.pcie: deferred probe failed
[ 2.442725] rockchip-pcie: probe of f8000000.pcie failed with error -110
[ 2.737720] ehci-pci: EHCI PCI platform driver
[ 4.379833] vcc3v3_pcie: disabling

@clarkis117
Copy link

clarkis117 commented Aug 10, 2020

Here's my output from a rockpro64 with a Compex WLE1216VX attached to the PCIE slot
root@FarmBox:~# dmesg
[ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd034]
[ 0.000000] Linux version 5.4.52 (builder@buildhost) (gcc version 8.4.0 (OpenWrt GCC 8.4.0 r14101-5d8fded26a)) #0 SMP PREEMPT Sun Aug 9 12:01:52 2020

root@FarmBox:~# dmesg | grep pci
[ 0.286897] vcc3v3_pcie: supplied by vcc12v_dcin
[ 0.314101] rockchip-pcie f8000000.pcie: no vpcie1v8 regulator found
[ 0.314139] rockchip-pcie f8000000.pcie: no vpcie0v9 regulator found
[ 0.868307] rockchip-pcie f8000000.pcie: PCIe link training gen1 timeout!
[ 0.868506] rockchip-pcie: probe of f8000000.pcie failed with error -110
[ 0.990147] ehci-pci: EHCI PCI platform driver

@StuartIanNaylor
Copy link

@clarkis117
You prob want to check the difference between m.2 & mini pcie.
RockPro64 is m.2 is it not?

@clarkis117
Copy link

clarkis117 commented Aug 11, 2020

@StuartIanNaylor it is a mini pice card, which I have in a mini pcie to pcie adapter. The rockpro64 has a 4x pcie card slot on its board. It may be a power design issue as I was able to use an Intel wifi adapter in the same setup, and the compex card in an x86 pc with the same adapter. The compex card has a TDP greater than 10 watts.

@nullr0ute
Copy link
Contributor

nullr0ute commented Oct 22, 2020

So one thing I found with testing is if you enable CONFIG_DEBUG_SHIRQ it shows up some issues on the driver. Some details here: https://patchwork.kernel.org/project/linux-rockchip/patch/1502353273-123788-1-git-send-email-shawn.lin@rock-chips.com/

Kwiboo pushed a commit to Kwiboo/linux-rockchip that referenced this issue Nov 29, 2020
[ Upstream commit 8cbcc5e ]

Handle destruction of rules with port destination type to enable
full destruction of flow.

Without this handling of TX rules the deletion of these rules fails.
Dmesg of flow destruction failure:

[  203.714146] mlx5_core 0000:00:0b.0: mlx5_cmd_check:753:(pid 342): SET_FLOW_TABLE_ENTRY(0x936) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x144b7a)
[  210.547387] ------------[ cut here ]------------
[  210.548663] refcount_t: decrement hit 0; leaking memory.
[  210.550651] WARNING: CPU: 4 PID: 342 at lib/refcount.c:31 refcount_warn_saturate+0x5c/0x110
[  210.550654] Modules linked in: mlx5_ib mlx5_core ib_ipoib rdma_ucm rdma_cm iw_cm ib_cm ib_umad ib_uverbs ib_core
[  210.550675] CPU: 4 PID: 342 Comm: test Not tainted 5.8.0-rc2+ rockchip-linux#116
[  210.550678] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[  210.550680] RIP: 0010:refcount_warn_saturate+0x5c/0x110
[  210.550685] Code: c6 d1 1b 01 00 0f 84 ad 00 00 00 5b 5d c3 80 3d b5 d1 1b 01 00 75 f4 48 c7 c7 20 d1 15 82 c6 05 a5 d1 1b 01 01 e8 a7 eb af ff <0f> 0b eb dd 80 3d 99 d1 1b 01 00 75 d4 48 c7 c7 c0 cf 15 82 c6 05
[  210.550687] RSP: 0018:ffff8881642e77e8 EFLAGS: 00010282
[  210.550691] RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000
[  210.550694] RDX: 0000000000000027 RSI: 0000000000000004 RDI: ffffed102c85ceef
[  210.550696] RBP: ffff888161720428 R08: ffffffff8124c10e R09: ffffed103243beae
[  210.550698] R10: ffff8881921df56b R11: ffffed103243bead R12: ffff8881841b4180
[  210.550701] R13: ffff888161720428 R14: ffff8881616d0000 R15: ffff888161720380
[  210.550704] FS:  00007fc27f025740(0000) GS:ffff888192000000(0000) knlGS:0000000000000000
[  210.550706] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  210.550708] CR2: 0000557e4b41a6a0 CR3: 0000000002415004 CR4: 0000000000360ea0
[  210.550711] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  210.550713] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  210.550715] Call Trace:
[  210.550717]  mlx5_del_flow_rules+0x484/0x490 [mlx5_core]
[  210.550720]  ? mlx5_cmd_set_fte+0xa80/0xa80 [mlx5_core]
[  210.550722]  mlx5_ib_destroy_flow+0x17f/0x280 [mlx5_ib]
[  210.550724]  uverbs_free_flow+0x4c/0x90 [ib_uverbs]
[  210.550726]  destroy_hw_idr_uobject+0x41/0xb0 [ib_uverbs]
[  210.550728]  uverbs_destroy_uobject+0xaa/0x390 [ib_uverbs]
[  210.550731]  __uverbs_cleanup_ufile+0x129/0x1b0 [ib_uverbs]
[  210.550733]  ? uverbs_destroy_uobject+0x390/0x390 [ib_uverbs]
[  210.550735]  uverbs_destroy_ufile_hw+0x78/0x190 [ib_uverbs]
[  210.550737]  ib_uverbs_close+0x36/0x140 [ib_uverbs]
[  210.550739]  __fput+0x181/0x380
[  210.550741]  task_work_run+0x88/0xd0
[  210.550743]  do_exit+0x5f6/0x13b0
[  210.550745]  ? sched_clock_cpu+0x30/0x140
[  210.550747]  ? is_current_pgrp_orphaned+0x70/0x70
[  210.550750]  ? lock_downgrade+0x360/0x360
[  210.550752]  ? mark_held_locks+0x1d/0x90
[  210.550754]  do_group_exit+0x8a/0x140
[  210.550756]  get_signal+0x20a/0xf50
[  210.550758]  do_signal+0x8c/0xbe0
[  210.550760]  ? hrtimer_nanosleep+0x1d8/0x200
[  210.550762]  ? nanosleep_copyout+0x50/0x50
[  210.550764]  ? restore_sigcontext+0x320/0x320
[  210.550766]  ? __hrtimer_init+0xf0/0xf0
[  210.550768]  ? timespec64_add_safe+0x150/0x150
[  210.550770]  ? mark_held_locks+0x1d/0x90
[  210.550772]  ? lockdep_hardirqs_on_prepare+0x14c/0x240
[  210.550774]  __prepare_exit_to_usermode+0x119/0x170
[  210.550776]  do_syscall_64+0x65/0x300
[  210.550778]  ? trace_hardirqs_off+0x10/0x120
[  210.550781]  ? mark_held_locks+0x1d/0x90
[  210.550783]  ? asm_sysvec_apic_timer_interrupt+0xa/0x20
[  210.550785]  ? lockdep_hardirqs_on+0x112/0x190
[  210.550787]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  210.550789] RIP: 0033:0x7fc27f1cd157
[  210.550791] Code: Bad RIP value.
[  210.550793] RSP: 002b:00007ffd4db27ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000023
[  210.550798] RAX: fffffffffffffdfc RBX: ffffffffffffff80 RCX: 00007fc27f1cd157
[  210.550800] RDX: 00007fc27f025740 RSI: 00007ffd4db27eb0 RDI: 00007ffd4db27eb0
[  210.550803] RBP: 0000000000000016 R08: 0000000000000000 R09: 000000000000000e
[  210.550805] R10: 00007ffd4db27dc7 R11: 0000000000000246 R12: 0000000000400c00
[  210.550808] R13: 00007ffd4db285f0 R14: 0000000000000000 R15: 0000000000000000
[  210.550809] irq event stamp: 49399
[  210.550812] hardirqs last  enabled at (49399): [<ffffffff81172d36>] console_unlock+0x556/0x6f0
[  210.550815] hardirqs last disabled at (49398): [<ffffffff81172897>] console_unlock+0xb7/0x6f0
[  210.550818] softirqs last  enabled at (48706): [<ffffffff81e0037b>] __do_softirq+0x37b/0x60c
[  210.550820] softirqs last disabled at (48697): [<ffffffff81c00e2f>] asm_call_on_stack+0xf/0x20
[  210.550822] ---[ end trace ad18c0e6fa846454 ]---
[  210.581862] mlx5_core 0000:00:0c.0: mlx5_destroy_flow_table:2132:(pid 342): Flow table 262150 wasn't destroyed, refcount > 1

Fixes: a7ee18b ("RDMA/mlx5: Allow creating a matcher for a NIC TX flow table")
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
scpcom pushed a commit to scpcom/linux that referenced this issue Dec 1, 2020
[ Upstream commit 8cbcc5e ]

Handle destruction of rules with port destination type to enable
full destruction of flow.

Without this handling of TX rules the deletion of these rules fails.
Dmesg of flow destruction failure:

[  203.714146] mlx5_core 0000:00:0b.0: mlx5_cmd_check:753:(pid 342): SET_FLOW_TABLE_ENTRY(0x936) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x144b7a)
[  210.547387] ------------[ cut here ]------------
[  210.548663] refcount_t: decrement hit 0; leaking memory.
[  210.550651] WARNING: CPU: 4 PID: 342 at lib/refcount.c:31 refcount_warn_saturate+0x5c/0x110
[  210.550654] Modules linked in: mlx5_ib mlx5_core ib_ipoib rdma_ucm rdma_cm iw_cm ib_cm ib_umad ib_uverbs ib_core
[  210.550675] CPU: 4 PID: 342 Comm: test Not tainted 5.8.0-rc2+ rockchip-linux#116
[  210.550678] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[  210.550680] RIP: 0010:refcount_warn_saturate+0x5c/0x110
[  210.550685] Code: c6 d1 1b 01 00 0f 84 ad 00 00 00 5b 5d c3 80 3d b5 d1 1b 01 00 75 f4 48 c7 c7 20 d1 15 82 c6 05 a5 d1 1b 01 01 e8 a7 eb af ff <0f> 0b eb dd 80 3d 99 d1 1b 01 00 75 d4 48 c7 c7 c0 cf 15 82 c6 05
[  210.550687] RSP: 0018:ffff8881642e77e8 EFLAGS: 00010282
[  210.550691] RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000
[  210.550694] RDX: 0000000000000027 RSI: 0000000000000004 RDI: ffffed102c85ceef
[  210.550696] RBP: ffff888161720428 R08: ffffffff8124c10e R09: ffffed103243beae
[  210.550698] R10: ffff8881921df56b R11: ffffed103243bead R12: ffff8881841b4180
[  210.550701] R13: ffff888161720428 R14: ffff8881616d0000 R15: ffff888161720380
[  210.550704] FS:  00007fc27f025740(0000) GS:ffff888192000000(0000) knlGS:0000000000000000
[  210.550706] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  210.550708] CR2: 0000557e4b41a6a0 CR3: 0000000002415004 CR4: 0000000000360ea0
[  210.550711] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  210.550713] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  210.550715] Call Trace:
[  210.550717]  mlx5_del_flow_rules+0x484/0x490 [mlx5_core]
[  210.550720]  ? mlx5_cmd_set_fte+0xa80/0xa80 [mlx5_core]
[  210.550722]  mlx5_ib_destroy_flow+0x17f/0x280 [mlx5_ib]
[  210.550724]  uverbs_free_flow+0x4c/0x90 [ib_uverbs]
[  210.550726]  destroy_hw_idr_uobject+0x41/0xb0 [ib_uverbs]
[  210.550728]  uverbs_destroy_uobject+0xaa/0x390 [ib_uverbs]
[  210.550731]  __uverbs_cleanup_ufile+0x129/0x1b0 [ib_uverbs]
[  210.550733]  ? uverbs_destroy_uobject+0x390/0x390 [ib_uverbs]
[  210.550735]  uverbs_destroy_ufile_hw+0x78/0x190 [ib_uverbs]
[  210.550737]  ib_uverbs_close+0x36/0x140 [ib_uverbs]
[  210.550739]  __fput+0x181/0x380
[  210.550741]  task_work_run+0x88/0xd0
[  210.550743]  do_exit+0x5f6/0x13b0
[  210.550745]  ? sched_clock_cpu+0x30/0x140
[  210.550747]  ? is_current_pgrp_orphaned+0x70/0x70
[  210.550750]  ? lock_downgrade+0x360/0x360
[  210.550752]  ? mark_held_locks+0x1d/0x90
[  210.550754]  do_group_exit+0x8a/0x140
[  210.550756]  get_signal+0x20a/0xf50
[  210.550758]  do_signal+0x8c/0xbe0
[  210.550760]  ? hrtimer_nanosleep+0x1d8/0x200
[  210.550762]  ? nanosleep_copyout+0x50/0x50
[  210.550764]  ? restore_sigcontext+0x320/0x320
[  210.550766]  ? __hrtimer_init+0xf0/0xf0
[  210.550768]  ? timespec64_add_safe+0x150/0x150
[  210.550770]  ? mark_held_locks+0x1d/0x90
[  210.550772]  ? lockdep_hardirqs_on_prepare+0x14c/0x240
[  210.550774]  __prepare_exit_to_usermode+0x119/0x170
[  210.550776]  do_syscall_64+0x65/0x300
[  210.550778]  ? trace_hardirqs_off+0x10/0x120
[  210.550781]  ? mark_held_locks+0x1d/0x90
[  210.550783]  ? asm_sysvec_apic_timer_interrupt+0xa/0x20
[  210.550785]  ? lockdep_hardirqs_on+0x112/0x190
[  210.550787]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  210.550789] RIP: 0033:0x7fc27f1cd157
[  210.550791] Code: Bad RIP value.
[  210.550793] RSP: 002b:00007ffd4db27ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000023
[  210.550798] RAX: fffffffffffffdfc RBX: ffffffffffffff80 RCX: 00007fc27f1cd157
[  210.550800] RDX: 00007fc27f025740 RSI: 00007ffd4db27eb0 RDI: 00007ffd4db27eb0
[  210.550803] RBP: 0000000000000016 R08: 0000000000000000 R09: 000000000000000e
[  210.550805] R10: 00007ffd4db27dc7 R11: 0000000000000246 R12: 0000000000400c00
[  210.550808] R13: 00007ffd4db285f0 R14: 0000000000000000 R15: 0000000000000000
[  210.550809] irq event stamp: 49399
[  210.550812] hardirqs last  enabled at (49399): [<ffffffff81172d36>] console_unlock+0x556/0x6f0
[  210.550815] hardirqs last disabled at (49398): [<ffffffff81172897>] console_unlock+0xb7/0x6f0
[  210.550818] softirqs last  enabled at (48706): [<ffffffff81e0037b>] __do_softirq+0x37b/0x60c
[  210.550820] softirqs last disabled at (48697): [<ffffffff81c00e2f>] asm_call_on_stack+0xf/0x20
[  210.550822] ---[ end trace ad18c0e6fa846454 ]---
[  210.581862] mlx5_core 0000:00:0c.0: mlx5_destroy_flow_table:2132:(pid 342): Flow table 262150 wasn't destroyed, refcount > 1

Fixes: a7ee18b ("RDMA/mlx5: Allow creating a matcher for a NIC TX flow table")
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
@vukitoso
Copy link

@jkoppen-headsfirst

Here too a boot freeze when a PCIe adapter is present (PCIe x1 to Mini PCIe adapter with Coral Edge TPU).

Hello. Have you solved the problem with the "Coral Edge TPU"?

@nullr0ute
Copy link
Contributor

@jkoppen-headsfirst

Here too a boot freeze when a PCIe adapter is present (PCIe x1 to Mini PCIe adapter with Coral Edge TPU).

Hello. Have you solved the problem with the "Coral Edge TPU"?

I've had reports the Coral Edge TPU does work on Fedora. Note the Edge TPU PCIe driver which was in staging upstream has now been dropped from the upstream kernel so the testing was by someone that built their own kernel to bring those drivers back.

@vukitoso
Copy link

vukitoso commented Nov 30, 2021

@nullr0ute
I don't have a coral edge yet, I'm still picking up a board.
Have you tried installing drivers according to the instructions https://coral.ai/docs/m2/get-started?
The drivers are not in the kernel, they are installed additionally.

@daiaji
Copy link

daiaji commented Jun 7, 2022

https://gitlab.manjaro.org/manjaro-arm/packages/core/linux/-/issues/34
There are some compatibility issues, a piece that can use an SSD on a PC will cause a kernel freeze/ on the RK3399.
https://gist.github.com/daiaji/eafa111f4d6dd0079561f16107e555d0
The u-boot also seems to have some faults,

FanX-Tek pushed a commit to FanX-Tek/kernel that referenced this issue Nov 30, 2022
…er invert"

This reverts commit 334791b.

Reason for revert:
The following warning appears on rk3588-evb1-lp4-v10 when suspend:
[   31.636037][  T414] unbalanced disables for vcc3v3_lcd0_n
[   31.636166][  T414] WARNING: CPU: 2 PID: 414 at drivers/regulator/core.c:2768 _regulator_disable+0x2e8/0x2f4
[   31.636191][  T414] Modules linked in: bcmdhd dhd_static_buf
[   31.636256][  T414] CPU: 2 PID: 414 Comm: composer@2.1-se Not tainted 5.10.110 rockchip-linux#116
[   31.636279][  T414] Hardware name: Rockchip RK3588 EVB1 LP4 V10 Board (DT)
[   31.636309][  T414] pstate: 60400009 (nZCv daif +PAN -UAO -TCO BTYPE=--)
[   31.636338][  T414] pc : _regulator_disable+0x2e8/0x2f4
[   31.636366][  T414] lr : _regulator_disable+0x2e8/0x2f4
...
[   31.636950][  T414] Call trace:
[   31.636980][  T414]  _regulator_disable+0x2e8/0x2f4
[   31.637009][  T414]  regulator_disable+0x40/0x84
[   31.637036][  T414]  panel_simple_unprepare+0x78/0xa4
[   31.637064][  T414]  drm_panel_unprepare+0x28/0x48
[   31.637094][  T414]  dw_mipi_dsi2_encoder_disable+0x70/0xbc
[   31.637123][  T414]  drm_atomic_helper_commit_modeset_disables+0x174/0x4d0
[   31.637154][  T414]  rockchip_drm_atomic_helper_commit_tail_rpm+0x44/0x184
[   31.637180][  T414]  commit_tail+0x110/0x200
[   31.637209][  T414]  drm_atomic_helper_commit+0x1f0/0x210
[   31.637238][  T414]  drm_atomic_commit+0x50/0x64
[   31.637268][  T414]  drm_mode_atomic_ioctl+0x620/0x744
[   31.637298][  T414]  drm_ioctl+0x24c/0x3b8
[   31.637328][  T414]  __arm64_sys_ioctl+0x94/0xd0
[   31.637359][  T414]  el0_svc_common+0xc0/0x23c
[   31.637388][  T414]  do_el0_svc+0x28/0x88
[   31.637417][  T414]  el0_svc+0x14/0x24
[   31.637446][  T414]  el0_sync_handler+0x88/0xec
[   31.637474][  T414]  el0_sync+0x1a8/0x1c0

Signed-off-by: Tao Huang <huangtao@rock-chips.com>
Change-Id: Id27946e0ef3a6c320214c961b8e9b02978a15f6b
stvhay referenced this issue in stvhay/kernel Feb 24, 2023
…er invert"

This reverts commit 334791b.

Reason for revert:
The following warning appears on rk3588-evb1-lp4-v10 when suspend:
[   31.636037][  T414] unbalanced disables for vcc3v3_lcd0_n
[   31.636166][  T414] WARNING: CPU: 2 PID: 414 at drivers/regulator/core.c:2768 _regulator_disable+0x2e8/0x2f4
[   31.636191][  T414] Modules linked in: bcmdhd dhd_static_buf
[   31.636256][  T414] CPU: 2 PID: 414 Comm: composer@2.1-se Not tainted 5.10.110 radxa#116
[   31.636279][  T414] Hardware name: Rockchip RK3588 EVB1 LP4 V10 Board (DT)
[   31.636309][  T414] pstate: 60400009 (nZCv daif +PAN -UAO -TCO BTYPE=--)
[   31.636338][  T414] pc : _regulator_disable+0x2e8/0x2f4
[   31.636366][  T414] lr : _regulator_disable+0x2e8/0x2f4
...
[   31.636950][  T414] Call trace:
[   31.636980][  T414]  _regulator_disable+0x2e8/0x2f4
[   31.637009][  T414]  regulator_disable+0x40/0x84
[   31.637036][  T414]  panel_simple_unprepare+0x78/0xa4
[   31.637064][  T414]  drm_panel_unprepare+0x28/0x48
[   31.637094][  T414]  dw_mipi_dsi2_encoder_disable+0x70/0xbc
[   31.637123][  T414]  drm_atomic_helper_commit_modeset_disables+0x174/0x4d0
[   31.637154][  T414]  rockchip_drm_atomic_helper_commit_tail_rpm+0x44/0x184
[   31.637180][  T414]  commit_tail+0x110/0x200
[   31.637209][  T414]  drm_atomic_helper_commit+0x1f0/0x210
[   31.637238][  T414]  drm_atomic_commit+0x50/0x64
[   31.637268][  T414]  drm_mode_atomic_ioctl+0x620/0x744
[   31.637298][  T414]  drm_ioctl+0x24c/0x3b8
[   31.637328][  T414]  __arm64_sys_ioctl+0x94/0xd0
[   31.637359][  T414]  el0_svc_common+0xc0/0x23c
[   31.637388][  T414]  do_el0_svc+0x28/0x88
[   31.637417][  T414]  el0_svc+0x14/0x24
[   31.637446][  T414]  el0_sync_handler+0x88/0xec
[   31.637474][  T414]  el0_sync+0x1a8/0x1c0

Signed-off-by: Tao Huang <huangtao@rock-chips.com>
Change-Id: Id27946e0ef3a6c320214c961b8e9b02978a15f6b
scpcom pushed a commit to scpcom/linux that referenced this issue Mar 21, 2023
[ Upstream commit fb6df43 ]

Lockdep reports that acpi_nfit_shutdown() may deadlock against an
opportune acpi_nfit_scrub(). acpi_nfit_scrub () is run from inside a
'work' and therefore has already acquired workqueue-internal locks. It
also acquiires acpi_desc->init_mutex. acpi_nfit_shutdown() first
acquires init_mutex, and was subsequently attempting to cancel any
pending workqueue items. This reversed locking order causes a potential
deadlock:

    ======================================================
    WARNING: possible circular locking dependency detected
    6.2.0-rc3 rockchip-linux#116 Tainted: G           O     N
    ------------------------------------------------------
    libndctl/1958 is trying to acquire lock:
    ffff888129b461c0 ((work_completion)(&(&acpi_desc->dwork)->work)){+.+.}-{0:0}, at: __flush_work+0x43/0x450

    but task is already holding lock:
    ffff888129b460e8 (&acpi_desc->init_mutex){+.+.}-{3:3}, at: acpi_nfit_shutdown+0x87/0xd0 [nfit]

    which lock already depends on the new lock.

    ...

    Possible unsafe locking scenario:

          CPU0                    CPU1
          ----                    ----
     lock(&acpi_desc->init_mutex);
                                  lock((work_completion)(&(&acpi_desc->dwork)->work));
                                  lock(&acpi_desc->init_mutex);
     lock((work_completion)(&(&acpi_desc->dwork)->work));

    *** DEADLOCK ***

Since the workqueue manipulation is protected by its own internal locking,
the cancellation of pending work doesn't need to be done under
acpi_desc->init_mutex. Move cancel_delayed_work_sync() outside the
init_mutex to fix the deadlock. Any work that starts after
acpi_nfit_shutdown() drops the lock will see ARS_CANCEL, and the
cancel_delayed_work_sync() will safely flush it out.

Reported-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Link: https://lore.kernel.org/r/20230112-acpi_nfit_lockdep-v1-1-660be4dd10be@intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
scpcom pushed a commit to scpcom/linux that referenced this issue Mar 21, 2023
[ Upstream commit fb6df43 ]

Lockdep reports that acpi_nfit_shutdown() may deadlock against an
opportune acpi_nfit_scrub(). acpi_nfit_scrub () is run from inside a
'work' and therefore has already acquired workqueue-internal locks. It
also acquiires acpi_desc->init_mutex. acpi_nfit_shutdown() first
acquires init_mutex, and was subsequently attempting to cancel any
pending workqueue items. This reversed locking order causes a potential
deadlock:

    ======================================================
    WARNING: possible circular locking dependency detected
    6.2.0-rc3 rockchip-linux#116 Tainted: G           O     N
    ------------------------------------------------------
    libndctl/1958 is trying to acquire lock:
    ffff888129b461c0 ((work_completion)(&(&acpi_desc->dwork)->work)){+.+.}-{0:0}, at: __flush_work+0x43/0x450

    but task is already holding lock:
    ffff888129b460e8 (&acpi_desc->init_mutex){+.+.}-{3:3}, at: acpi_nfit_shutdown+0x87/0xd0 [nfit]

    which lock already depends on the new lock.

    ...

    Possible unsafe locking scenario:

          CPU0                    CPU1
          ----                    ----
     lock(&acpi_desc->init_mutex);
                                  lock((work_completion)(&(&acpi_desc->dwork)->work));
                                  lock(&acpi_desc->init_mutex);
     lock((work_completion)(&(&acpi_desc->dwork)->work));

    *** DEADLOCK ***

Since the workqueue manipulation is protected by its own internal locking,
the cancellation of pending work doesn't need to be done under
acpi_desc->init_mutex. Move cancel_delayed_work_sync() outside the
init_mutex to fix the deadlock. Any work that starts after
acpi_nfit_shutdown() drops the lock will see ARS_CANCEL, and the
cancel_delayed_work_sync() will safely flush it out.

Reported-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Link: https://lore.kernel.org/r/20230112-acpi_nfit_lockdep-v1-1-660be4dd10be@intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests