Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raspberry Pi 5 NVMe using PCIe not working with kernel 6.6.x #5873

Open
teoteo opened this issue Jan 20, 2024 · 24 comments
Open

Raspberry Pi 5 NVMe using PCIe not working with kernel 6.6.x #5873

teoteo opened this issue Jan 20, 2024 · 24 comments

Comments

@teoteo
Copy link

teoteo commented Jan 20, 2024

Describe the bug

Connecting an NVMe SSD using the Pimoroni "NVMe Base for Raspberry Pi 5" on Arch Linux ARM with kernel versions prior to 6.6.x not on versions 6.6.9 and higher.

Steps to reproduce the behaviour

Using Arch Linux (in my case AstroArch) with connected a PCIe extension board with NVMe SSD connected.

  1. Boot from release 1.7 with kernel 6.1.61-1-rpi-ARCH
  2. with command lsblk I can see the SSD
  3. update the system with update-astroarch with kernel 6.6.12-1-rpi and reboot
  4. with command lsblk I can’t see the SSD
  5. downgrade the kernel with sudo pacman -U https://alaa.ad24.cz/packages/l/linux-rpi/linux-rpi-6.1.70-1-aarch64.pkg.tar.xz and reboot
  6. with command lsblk I can see the SSD

Device (s)

Raspberry Pi 5

System

OS: Arch Linux ARM aarch64
Host: Raspberry Pi 5 Model B Rev 1.0
vcgencmd not available
Kernel: Linux astroarch 6.6.12-1-rpi #1 SMP PREEMPT Wed Jan 17 19:05:39 MST 2024 aarch64 GNU/Linux

Logs

No response

Additional context

I’ve tried with different kernels and with kernels after 6.6.x (linux-rpi-16k too) can’t see the SSD.
Asking on Arch Linux ARM forum don’t seems to be a modules issue.

@JamesH65
Copy link
Contributor

Is the drive you are using in this compatibility list? https://shop.pimoroni.com/products/nvme-base?variant=41219587178579

@teoteo
Copy link
Author

teoteo commented Jan 20, 2024

It isn’t listed: it’s a Samsung 970 EVO 1TB. It woks on Raspberry Pi OS:
2024/01/05 15:57:40
Copyright (c) 2012 Broadcom
version 30cc5f37 (release) (embedded)
Raspberry Pi reference 2023-12-05

Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 70cd6f2a1e34d07f5cba7047aea5b92457372e05, stage4
Linux raspberrypi 6.1.0-rpi7-rpi-2712 #1 SMP PREEMPT Debian 1:6.1.63-1+rpt1 (2023-11-24) aarch64 GNU/Linux
pibenchmarks 970 EVO 2024-01-20 alle 10 29 35

but if it can help, I can try with a listed Crucial P3 4TB.

@popcornmix
Copy link
Collaborator

It isn’t listed: it’s a Samsung 970 EVO 1TB. It woks on Raspberry Pi OS:

On RPiOS can you run update to 6.6 kernel and confirm you see the same failure as you do on Arch?

@teoteo
Copy link
Author

teoteo commented Jan 20, 2024

On RPiOS can you run update to 6.6 kernel and confirm you see the same failure as you do on Arch?

Updated and it works:

matteo@raspberrypi:~ $ vcgencmd version; cat /etc/rpi-issue; uname -a
2024/01/15 19:02:28 
Copyright (c) 2012 Broadcom
version f0aa0715 (release) (embedded)
Raspberry Pi reference 2023-12-05
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 70cd6f2a1e34d07f5cba7047aea5b92457372e05, stage4
Linux raspberrypi 6.6.12-v8-16k+ #1723 SMP PREEMPT Thu Jan 18 13:00:39 GMT 2024 aarch64 GNU/Linux
matteo@raspberrypi:~ $ lsblk 
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
loop0         7:0    0  93.3M  1 loop /snap/core/16204
sda           8:0    1     0B  0 disk 
mmcblk0     179:0    0  58.9G  0 disk 
├─mmcblk0p1 179:1    0   512M  0 part /boot/firmware
└─mmcblk0p2 179:2    0  58.4G  0 part /
nvme0n1     259:0    0 931.5G  0 disk 
├─nvme0n1p1 259:1    0   256M  0 part 
└─nvme0n1p2 259:2    0 117.5G  0 part 

Do you have suggestions for Arch Linux ARM guys to better investigate the issue?

@graysky2
Copy link

graysky2 commented Jan 21, 2024

Could the issue be that other package is not providing something needed? ArchARM ships RPiOS bcm2712_defconfig as our baseline config. We add a few modules and settings but the basis set is there. I do not have NVMe hardware for testing on my RPi5/ArchARM and have no experience using these devices.

  1. Let's rule out that something in the Arch ARM kernel config modifications is causing this.
  • I built a version of the linux-rpi-16k kernel package using the .config generated by make bcm2712_defconfig (for reference full .config) without any modifications which should be the same one as RPiOS is using. Download the kernel package from here.
  • Does booting into that kernel allow the device to work?

@teoteo
Copy link
Author

teoteo commented Jan 22, 2024

  • Does booting into that kernel allow the device to work?

Yes, with that kernel the device works:

~ » uname -a; lsblk                                                                              astronaut@astroarch
Linux astroarch 6.6.12-2-rpi-16k #1 SMP PREEMPT Sun Jan 21 11:48:27 EST 2024 aarch64 GNU/Linux
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
mmcblk0     179:0    0  29.1G  0 disk 
├─mmcblk0p1 179:1    0   256M  0 part /boot
└─mmcblk0p2 179:2    0  28.9G  0 part /
nvme0n1     259:0    0 931.5G  0 disk 
├─nvme0n1p1 259:1    0   256M  0 part 
└─nvme0n1p2 259:2    0 117.5G  0 part 

@seamusdemora
Copy link

@teoteo : So glad you've opened this issue... my NVMe hardware (also fm Pimoroni) is to be delivered this week!

@graysky2
Copy link

graysky2 commented Jan 22, 2024

Yes, with that kernel the device works:

OK, so something in the options we add to bcm2712_defconfig is causing the problem. The added options are here. At first glance, I did not see anything obviously to blame. @popcornmix @pelwell - do you have any thoughts?

The modified .config is generated like this:

    make bcm2712_defconfig
    cat archarm.diffconfig >> .config
    make oldconfig

For reference, the following commit illustrates the fully expanded .config highlighting the difference between RPiOS's bcm2712_defconfig and the one that is apparently to blame for the reported issue: graysky2/PKGBUILDs@4846daa

Again, I am not seeing any smoking gun. Grateful for any thought you may have.

@P33M
Copy link
Contributor

P33M commented Jan 23, 2024

I've just hit this with rpi-update - compare pull/5882 with 6.1.74 -> bootloader probes NVMe, Linux fails to establish a link. On 6.1.74, Linux probes fine.

@P33M
Copy link
Contributor

P33M commented Jan 23, 2024

Reverting this hunk which changed between 6.1 and 6.6 fixes things for me.

@@ -393,42 +395,34 @@ static u32 brcm_pcie_mdio_form_pkt(int port, int regad, int cmd)
 /* negative return value indicates error */
 static int brcm_pcie_mdio_read(void __iomem *base, u8 port, u8 regad, u32 *val)
 {
-       int tries;
        u32 data;
+       int err;

        writel(brcm_pcie_mdio_form_pkt(port, regad, MDIO_CMD_READ),
                   base + PCIE_RC_DL_MDIO_ADDR);
        readl(base + PCIE_RC_DL_MDIO_ADDR);
-
-       data = readl(base + PCIE_RC_DL_MDIO_RD_DATA);
-       for (tries = 0; !MDIO_RD_DONE(data) && tries < 10; tries++) {
-               udelay(10);
-               data = readl(base + PCIE_RC_DL_MDIO_RD_DATA);
-       }
-
+       err = readl_poll_timeout_atomic(base + PCIE_RC_DL_MDIO_RD_DATA, data,
+                                       MDIO_RD_DONE(data), 10, 100);
        *val = FIELD_GET(MDIO_DATA_MASK, data);
-       return MDIO_RD_DONE(data) ? 0 : -EIO;
+
+       return err;
 }

 /* negative return value indicates error */
 static int brcm_pcie_mdio_write(void __iomem *base, u8 port,
                                u8 regad, u16 wrdata)
 {
-       int tries;
        u32 data;
+       int err;

        writel(brcm_pcie_mdio_form_pkt(port, regad, MDIO_CMD_WRITE),
                   base + PCIE_RC_DL_MDIO_ADDR);
        readl(base + PCIE_RC_DL_MDIO_ADDR);
        writel(MDIO_DATA_DONE_MASK | wrdata, base + PCIE_RC_DL_MDIO_WR_DATA);

-       data = readl(base + PCIE_RC_DL_MDIO_WR_DATA);
-       for (tries = 0; !MDIO_WT_DONE(data) && tries < 10; tries++) {
-               udelay(10);
-               data = readl(base + PCIE_RC_DL_MDIO_WR_DATA);
-       }
-
-       return MDIO_WT_DONE(data) ? 0 : -EIO;
+       err = readw_poll_timeout_atomic(base + PCIE_RC_DL_MDIO_WR_DATA, data,
+                                       MDIO_WT_DONE(data), 10, 100);
+       return err;
 }

So, MDIO prodding broke - but why does RP1 still work?

@pelwell
Copy link
Contributor

pelwell commented Jan 23, 2024

i.e. this commit: ca5dcc7

@popcornmix
Copy link
Collaborator

popcornmix commented Jan 23, 2024

Does the switch from -EIO to -ETIMEDOUT matter?
Or is it the change in timeout duration? (I feel the replacement will timeout a fraction earlier)

pelwell added a commit that referenced this issue Jan 23, 2024
…mic()"

This reverts commit ca5dcc7.

Reverting because it breaks NVME usage.

See: #5873

Signed-off-by: Phil Elwell <phil@raspberrypi.com>
@P33M
Copy link
Contributor

P33M commented Jan 23, 2024

I think the bug is doing readw into a u32 then testing the top bit for "done", and exiting if it's 0. This means writes done in rapid succession will collide.

@pelwell
Copy link
Contributor

pelwell commented Jan 23, 2024

Yes, the readw looks like a typo/thinko (w = word). So you just want:

-       err = readw_poll_timeout_atomic(base + PCIE_RC_DL_MDIO_WR_DATA, data,
+       err = readl_poll_timeout_atomic(base + PCIE_RC_DL_MDIO_WR_DATA, data,
                                        MDIO_WT_DONE(data), 10, 100);

@P33M
Copy link
Contributor

P33M commented Jan 23, 2024

Yep that fixes it.

@pelwell
Copy link
Contributor

pelwell commented Jan 23, 2024

Just push your patch.

@pelwell
Copy link
Contributor

pelwell commented Jan 23, 2024

Include:

Fixes: ca5dcc76314d ("PCI: brcmstb: Replace status loops with read_poll_timeout_atomic()")

@graysky2
Copy link

graysky2 commented Jan 23, 2024

Did I misread something? @teoteo - if you booted with the unmodified 6.6.12-1 (built from 7df1d3b with ArchARM's config) the bug was triggered yet if you booted with the kernel I prepared using that same commit yet RPiOS's config not ArchARM's config, you did not see the bug. Is this correct?

@pelwell
Copy link
Contributor

pelwell commented Jan 23, 2024

That would have been on a 6.1 kernel, which hasn't received the offending patch.

@pelwell
Copy link
Contributor

pelwell commented Jan 23, 2024

But then:

Updated and it works:

matteo@raspberrypi:~ $ vcgencmd version; cat /etc/rpi-issue; uname -a
2024/01/15 19:02:28 
Copyright (c) 2012 Broadcom
version f0aa0715 (release) (embedded)
Raspberry Pi reference 2023-12-05
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 70cd6f2a1e34d07f5cba7047aea5b92457372e05, stage4
Linux raspberrypi 6.6.12-v8-16k+ #1723 SMP PREEMPT Thu Jan 18 13:00:39 GMT 2024 aarch64 GNU/Linux

I can't explain that.

@graysky2
Copy link

graysky2 commented Jan 23, 2024

@pelwell - if you see here, @teoteo is using the 6.6.12 kernel built from that commit....

In any case, I am glad you guys got this fixed. I updated ArchARM's kernels/should build and hit a mirror soon.

@MattBlack85
Copy link

I would like to thank anyone here. Really appreciate this has been resolved thanks to collective efforts!

@teoteo
Copy link
Author

teoteo commented Jan 25, 2024

I did some testing and it works on my system. Thank you all.

~ » uname -a; lsblk                                                                                                                                        
Linux astroarch 6.6.13-2-rpi #1 SMP PREEMPT Tue Jan 23 13:37:27 MST 2024 aarch64 GNU/Linux
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
mmcblk0     179:0    0 116.2G  0 disk 
├─mmcblk0p1 179:1    0   256M  0 part 
└─mmcblk0p2 179:2    0 115.9G  0 part 
nvme0n1     259:0    0 931.5G  0 disk 
├─nvme0n1p1 259:1    0   256M  0 part /boot
└─nvme0n1p2 259:2    0 931.3G  0 part /

@malteneuss
Copy link

Is this fix something that can help with the other NVMe-not-working cases for the normal RPI Debian?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants