-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PCI Express regression on CM4 / CM4 IO Board on stable_20231004 (6.1.54, Bookworm) #5659
Comments
Hello again there, Small update:
kernel-panic-1.txt I now seem to have a working configuration with the workarounds. I'll keep you informed if anything else comes up, and I keep available for any additional test or information you may need. |
Hello, it's me again I took some others PCIe cards to check everything; no issue with VL805 PCIe card, however I found out this regression also affects the following PCIe card:
Even having bootloader able to use the card for booting, kernel won't boot or tell anything unless Serial output without
But as told above, as soon as the option is added, boot is occurring fine (and the PCIe card is working).
Note about ASM1166 firmware: |
Hi, I have done the same workaround/s as you have here - I am still getting the bus errors however as of right now I have not had any Kernal panics, which i should've done by now, and 95% of errors it is correcting. I'll be honest I am in way over my head at this point but I am wondering did you experience this? Can i just go with the ignorance is bliss approach here if it isn't crashing? This problem has been driving me crazy all week now, at this point i'm ready to just ignore the bus errors if it appears to be wri0ting data correctly. It may be worth mentioning I am using a 6 Port card using a ASM1166 chip as well as a USB sata card with a VL805? chip, running through a PCIE switch - I believe these are functioning perfectly as I have no problems without the sata card. EDIT: I'm a fool, I didn't add the fix to the first line in |
Hi @aimbotbob, Fortunately, with the workaround parameters in However I still had a startup crash once (on a reboot). By chance, because I was paranoid enough I still had the UART reader attached and got the kernel log 2023-10-24-startup-crash-despite-workaround.txt. So I'm still hoping for something that would solve this PCIe issues anyway; because for now, I had to disable auto reboot into my unattended-upgrades config file just in case, so that I only reboot the NAS while I'm physically able to do a power cycle in case the reboot goes wrong 😩 considering releasing such a device is kinda unthinkable at this point... 😞 |
@julienrobin28 If it makes you feel any better I had mine lockup lastnight, there is nothing in the logs to indicate what it is that caused it either (apart from lsyncd starting several processes)- I have been using lsyncd to back up my files, I believe it may be this may exacerbate the issue so i have limited its processes to 1 at a low bandwidth. I have got another sata card knocking about on another chipset, which one that is i couldn't tell you off the top of my head but i have had trouble to get it powered up by the pi. I will be ordering some powered pci risers today that should (fingers crossed) fit in the enclosure i am using, would you like me to keep you posted to see if things are more stable on another chipset? |
Sorry to drag this up, i'm getting the same with this waveshare CM4 PCIE SATA adapter I am using. Is the working simply to add the two entries to I can navigate around tech, but I have never done something like recompiling a kernel. Sorry to drag this old thread up from the dust! |
@AhnFire No problem, you're welcome, this thread is still open (I'm even still hoping that something ends up by solving this issue!) For the related workaround, yes you can apply it by adding these parameters to the By the way, in case you need it some day: Some interesting and very complete information is available here in case you want to do it some day |
I'd almost say the opposite - build your first kernel before you need to, using one of the Pi defconfig files and making no changes, just to get it out of the way and so you know that if it doesn't work that it is less likely to be your fault. Once you've got that working, feel free to change stuff. |
Great, thank you both! :) Looking forward to trying out the workaround for now so I can get my little mini-NAS project going. |
Update: After correcting a formatting error (I was using comma-separated after reading the existing entries incorrectly) and also noting there was a 3rd parameter, I had about 10 reboots that went smooth. Including 1 power down and cold boot. This is my cmdline.txt entry, if this helps anyone: With the first couple of cold boots, it still went into kernel panic, but then it seemed to stabilize, I don't understand why. I have a suspicion about doing Question, are others finding things more stable with Bullseye over Bookworm? Before I saw this workaround, I tried Bullseye, but I was still getting kernel panics. I have not tried since re-flashing my emmc with Bookworm and using this workaround. Does this workaround reduce the performance of the PCI port, does anyone know? |
Reading a little about the parameters we had to set in order to get a stable boot, I think I will lose a lot of the features that this board is supposed to give. Onboard SATA host controller (AHCI) with upstream PCle Gen3 x1 and downstream four SATA Gen3 ports. It's a low latency, low cost and low power AHCI controller. With four SATA ports and cascaded port multipliers, it can enable users to build up various high-speed IO systems, including server, high-capacity system storage or surveillance platforms Supports 1-ch PCI Express https://www.waveshare.com/pcie-to-sata-4p.htm I wonder if this is relevant to my card (using the same chipset). This points to a general Linux MR. |
I'm experiencing the same issue with a ASMedia ASM1064 SATA adapter: I was unsure whether this related to upstream kernel bug #217276). Tried applying Jim Quinlan's patchset (v9), but this did not reduce the probability of a kernel panic. |
Greetings,
with the cmdline.txt options mentioned above, I only get varying success, instead of always the same message at the same point: here is the kernel config I created with |
Does adding "pcie_aspm=off" in cmdline.txt resovle the issue? |
Adding |
Sorry for re-opening this old thread, however thanks for this it really helps. Using the workaround I could run PiOS 12 with the ASMedia SATA PCIe card installed. However, when I installed OMV7 (Open Media Vault https://www.openmediavault.org/) the PCIe card stopped working. The changes to cmdline.txt are still in the file but it seems like it is not being used. I am not sure if OMV7 is booting the system differently. I have asked about it on the OMV forum but no response so far. So I thought I might compile with kernel with the changes in it so I did not have to use cmdline.txt because it does not seem to work for OMV7. What did you do when you compiled the kernel? Did you use "menuconfig" or edit the".config" file manually? Did you just comment out the following lines? CONFIG_PCIEPORTBUS=y Thanks! |
Hi @CyberLeader3000 and sorry for the delay, I took a look at my files about this issue, and I collected some of the related information. Into the Raspberry Pi fork of the Linux kernel, the PCI related default options were changed from bcm2711_defconfig-from-rpi-6.1.21.txt Taking a look with "Meld" shows the following differences about them: However, after the build configuration step, those files are used as basis to create a Meld shows the following differences about associated Depending on the build options of the kernel you are using, check about options like If Just in case: Beware
|
Hi julienrobin28, Debugging is a little bit hard for me because I am running headless with the "lite" version of PiOS. I need to remove the HAT and connect a UART to USB to see if I can get boot information. I tried building a couple kernels with different configurations, however the PCIe did not work in them. :-( I decided to start again with a new PiOS Bookworm image and then updated it. It seemed to work with the SATA card installed so I installed OMV7 and it was still booting and working with the SATA card installed. I hot plugged an SSD and it worked as well. This was looking good. I then re-booted the system and it would not boot. So it looks like it only boots if there are no discs connected to the SATA connectors. :-( It is not really practical to always hot-plug the drives after a re-boot. My root filesystem is on the SD card just to make it easier to change and backup. I have been using ASM 1064 SATA cards so I bought an ASM1166 like you use to see if it makes a difference. I tried the ASM 1166 and it did not work for me. So my plan: I will also see if I can get any input from the Pi forums. Thanks! |
I have been doing a bit more debugging on this problem. I tried several different configurations and custom kernels but nothing worked (or at least not consistently). It is hard to know what is happening because the NAS runs headless, so connect a UART to USB adapter and enable the port by adding enable_uart=1 and uart_2ndstage=1 to config.txt. It seems to work consistently with both changes. If I comment out the debug port line in config.txt the system gets a kernel panic. The system boots and works even when the debug hardware is not connected. There seems to be 2 problems. It looks like there is a change to the PCI clkreq# modes configuration and the cmdline.txt fixes this problem. There might be a timing/synchronization problem and the writing to the debug port seems to make this work. It could be that a sync. of some sort needs to be added to the boot procedure. While my system appears to be working at the moment, I am not sure if this is a stable long term solution. Interesting information: |
So I have just come back to this thread as frankly I got frustrated with the Pi / SATA issue and decided to put it into the fuck-it bucket for a little while, it is nice to discover that the problem has only gotten worse in my absence and my very limited Linux knowledge has only regressed in that time. So today I have spent several hours with multiple different SATA cards (all ASM based besides one (VL805)), PCIE risers and Kernel configs and not a single bloody combination of any can even get my Pi to detect a SATA card at this point - looks like I have got similar problems to @CyberLeader3000 . I'll spend some more time tinkering over the next few days but I am already close to my wits end with this thing already, if no one hears from me it is because I got pissed off and switched to some hardware that isn't going to make me jump through hoops just to get SATA running. |
Hi @aimbotbob I believe the PCIe bus of BCM2711 just turned out "not good enough" to many other things than the VL805 which was embedded into the Pi 4 boards. Those new enabled by default kernel features just made it more visible, instead of literally being a "regression". I don't know whose fault is it but I don't believe that was neither intentional nor planned... There is a guy (a Geerling guy if I'm right) who try to list every PCIe devices that has been tested on Raspberry Pi devices because of the fact that a lot of them aren't working fine. About me, I switched to Pi 5 board + 52Pi P02 PCIe X1 adapter board.
About the ASM1166, the Pi 5 and the 52Pi board, it says that it can go to PCIe Gen 3 but in reality, no. It's unstable, it should be kept to Gen 2 (which is perfectly stable). Remaining problems:
About how these problems are frustrating: It makes me realize how much work is provided for every devices around us (TV boards, routers, connected gadgets etc), as every problem we gets while crafting our toys are inevitably encountered by many other people on many others project and devices. While I want to see this as "the bare minimum", in reality, succeeding in being uncompromising on anything that is not working fine is very huge work, even when targeting a fixed/same usage for every customer with no updates in the future. And I'm not even talking about feedback and unexpected issues appearing once the device is installed and serving in real life 😱 if you add the updates, unavoidable changes between OS versions, firmware, drivers, libraries, etc, and it's literally an endless work. Using others devices, I quickly realized that still having support with latest OS and kernels, and even improvement (most of the time 😅) for boards that are more than 10 years old is almost impossible outside of Raspberry Pi Foundation. I still have some Raspberry Pi 1 working fine with both IMX219 cameras and WiFi dongles, including a completely rewritten camera stack (that I sometimes hated of course, but at the end with a lot of work I got everything eventually working 100% fine). I even got unable to get new WiFi cards (Intel BE200) to work on PC AM4 platform because of an incompatibility between Intel BE200 and some AGESA version, and both AMD and Intel are being completely quiet for months about it, so I'm forced to note that things like this are existing even in the x86_64 PC world. My conclusion is: Strangely enough, despite all of this endless/exhausting stream of problem encountered with Pi devices, at the end it remains really respectable devices, while not succeeding in being as perfect as expected. And yes, we are still going be angry against problems in the future 😁 👍 we should be prepared to keep up with this, as many of them may be solved and/or worked around. May be more work should be done globally on making existing technologies more perfect and reliable instead of creating so much new technologies above them (but it's probably out of the scope of this issue anyway, and may be out of scope for a single company). |
Hi julienrobin28, Thanks for more work on this topic. I also posted this issue on the Raspberry Pi forums (https://forums.raspberrypi.com/viewtopic.php?t=375290&e=1&view=unread#unread) and got a reply from a forum moderator. He replied but has not looked into it yet. I tried a Pi5 + ASM1166 + 52Pi PCI board and it did not work but I did not modify config.txt. Do you know if there is a similar dtoverlay for the CM4? I will look into dtoverlay. Are you using more than 1 drive with Pi5 + ASM1166 + 52Pi? With the original changes to cmdline.txt, I can get CM4 + ASM1064 working with 1 SSD drive. If I add more drives it stops working. :-( I have now soldered a debug console connector to the HAT board so I can see what is happening when it boots. I have not had a chance to look at it in detail but it looks like when it boots it does not see the HDD and must scan for it later. When it does not boot it seems to find the HDD while first boot. I need to do more investigation. I think it looks like Jeff Geerling's PCI page has space for Pi5 information but none has been added yet. I guess he really also needs a Pi 4 Bookworm column. |
@CyberLeader3000 - I have things split between CM4/Pi 4 and Pi 5 (and presumably CM5 at some point), just because the physical implementation differs between BCM2711 and BCM2712. Trying to add a matrix of all distros + versions would make it a bit heavy, so I'll keep it divided by hardware only. The GitHub issues attached to particular devices has discussion about any quirks or problems that crop up with later OS revisions, there are already a few issues like the Intel AX201/AX200 WiFi adapters where people have noticed some PCIe issues cropping up with later Pi OS releases which require workarounds (which weren't a problem in Pi OS 11). |
I re-plugged my CM4 + CM4 IO Board and tested again about the ASM1166 PCIe SATA card. This allows me to confirm the issue is still existing, and past reported observations about this issue are still valid as of today (2024/09/13). Nothing changed in my case; but I'm detailing everything here in case it turns to be useful to spot differences with your configuration.
The output of my
By the way,
The default With the default command line I still get the kernel panics, and everything goes back to "fine" when adding I can also confirm I can boot it with several SATA devices already connected to it (I was already using it with 4 devices in the past, 3 x SATA 12 TB HDD + 1 SATA SSD - this is still what I'm using today on the Pi 5). SATA Hot-plug also works. Also, moving the rootfs out of the SD card (into a SATA SSD connected through the ASM1166 PCIe SATA adapter card) still works fine in my case. In order to achieve this, from another computer I did the followings:
Since Raspberry Pi OS 12 if I'm right, the boot partition (which is still on the SD Card in my case) contains an "initramfs" which contains the required kernel modules to access SATA drives behind PCIe adapters. Note about ASM1166 firmware: Note about dtoverlays: Hoping this may help! |
Hi @julienrobin28, Thanks this really helps. I have done some more investigation as well. I bought an ASM1166 card and 52Pi board for my RPi 5 so I could test your setup as well. For the CM4 I used a 3.3v UART to USB adapter and the debug changes to config.txt Updated fresh PiOS on SD card with dtoverlay added to config.txt running on a Raspberry Pi 5 + 52Pi PCI board. I connected 2 drives (2.5" Ironwolf SDD and 2.5" Samsung HDD). -ASM1064, worked and ran OMV7 Updated fresh PiOC on SD with added cmdline.txt options running on a Raspberry Pi CM4 + carrier board. I connected 2 drives (2.5" Ironwolf and 2.5" Samsung HDD). -ASM1064, worked and ran OMV7 (most of the time?) When it failed to boot there were 2 failure modes. The first it just hangs:
The second failure mode is: This error seems to happen repeatedly and it takes several tries to get it to boot again. In general, my CM4 OMV7 with the ASM1064 is running ok, however, I am worried that after a power failure, it will not come back on. It is good when someone can cycle power but it may be a problem if no one is around. I have not measured the time but I think one issue is that boot times appear to be longer, so sometimes I thought it did not boot but it was still booting. This is where the debug port really helps. |
Hi @CyberLeader3000 and thanks for the feedback, which brings interesting details About 1st failure mode: 2024-01-25-cm4-pcie-asm1166-stuck-after-2nd-stage.txt It was on 2024/01/25, at the end of the 2nd stage boot loader the green LED was stuck on and the UART log stopped at the exact same place. About 2nd failure mode: About the probability of those failures: For the ASM1064 on CM4 seeming to work fine, as you pointed out, unless it is used for a very long period (for something important enough to quickly notice when it's offline), it's hard to ensure it's always going to boot fine. When a device is risking to freeze itself during each of its reboots (after updates or even voltages drops on the electrical grid for example) most of the time when you realize it's offline you're screwed / not at side of it 😅 Anyway nice to confirm too that both for me and you, the Pi 5 isn't affected by this issue 🥳 so it's only about CM4 |
Hi @julienrobin28, Thanks again for confirming things. I think failure mode 1 happens less often and is easy to recover, so I am less worried about it. I think the next message should be "Booting Linux" so the transition from the boot code to Linux fails. The second failure mode I think is more complex and timing-related. The failure happens much more often when drives are connected. Without drives connected, it does not seem to happen. Looking at the log message, it is interesting how some of the message order and timings change. I think this might happen because of variations like the power supply ramp, PLL lock, and drives (if attached). I don't know much about Linux internals but I know the ARM cores a bit. It looks like a System Error (SError) is getting caught by the EL1 (Exception Level 1 - Supervisor mode) interrupt handler. This is normal and what it should do. System errors are normally caused by something that can not be traced back to a source easily, things like a writeback from a cache to inaccessible memory. It looks like the original cause might be an EL0 (Exception Level 0 - User mode) interrupt in the AHCI driver that was not handled. This is interesting but I have no idea how to investigate a potential issue with the AHCI driver. I leave the rootfs on the SD so it is easy for me to swap OS versions. I think this is just a boot issue, but I am going to run my NAS for a longer time and see how it works. Thanks! |
Describe the bug
Hi,
I switched from Raspberry Pi OS version 11 to version 12 on my Compute Module 4 IO Board, and noticed the SATA PCIe adapter I'm using now causes an almost consistent crash while booting the OS, when PCIe starts running.
The issue is encountered with pcieport driver, on 6.1.0-rpi4-rpi-v8 kernel (6.1.54)
After lot of retries, I had it booting once with the SATA card present without panicking, so I took the "dmesg" output, here are the interesting lines to see:
However when kernel panic occurs, rootfs doesn't seem to be ready so nothing is logged into it (so here is an old fashioned picture of my screen attached).
When the SATA card is missing, no error occurs.
I tried to put back the 6.1.21 kernel (and associated modules, overlays, dtb) from Raspberry Pi OS 11 to my new Raspberry Pi OS 12 installation, and everything was back to working as before, confirming the issue only is about the kernel.
Having the "initramfs" loaded or not has no effect.
Comparing the kernel config from previous kernel and new kernel shows that the new kernel config is now enabling PCIe related additional features, at least the following options:
When rebuilding the new 6.1.54 kernel with the old 6.1.21 kernel config, it works fine (no "pcieport" driver issue as it isn't enabled in the kernel config)
However, I'm not sure why this "pcieport" issue occurs, and currently, I don't have any other PCIe card to be tried on it, unfortunately.
Steps to reproduce the behaviour
Boot the new Raspberry Pi OS 12 on a Compute Module 4 IO Board, with the following PCIe card:
SATA controller [0106]: ASMedia Technology Inc. ASM1166 Serial ATA Controller [1b21:1166] (rev 02)
(I don't know if the issue also occurs with others PCIe cards).
Device (s)
Raspberry Pi CM4
System
Raspberry Pi reference 2023-10-10
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 962bf483c8f326405794827cce8c0313fd5880a8, stage2
Aug 10 2023 15:33:38
Copyright (c) 2012 Broadcom
version 03dc77429335caee083e22ddc8eec09c07f12a7a (clean) (release) (start)
Linux crobe-server-coudray 6.1.0-rpi4-rpi-v8 #1 SMP PREEMPT Debian 1:6.1.54-1+rpt2 (2023-10-05) aarch64 GNU/Linux
Logs
PCI-Express-BUG-6.1.54-dmesg.txt
PCI-Express-BUG-6.1.54-lspci-nn-vvv.txt
PCI-Express-OK-6.1.21-lspci-nn-vvv.txt
Additional context
EDIT from 2023-10-18 in the evening:
I found a workaround to work without changing the kernel, to avoid kernel panics by looking at available command-line parameters for Linux kernel 6.1
After having added
pcie_aspm=off
to/boot/firmware/cmdline.txt
, I don't have kernel panics anymore.However dmesg messages about PCIe Bus Error and AER are still shown, unless
pcie_ports=compat
is added too.Adding
pcie_ports=compat
alone, however, does not avoid kernel panics (it just removes the dmesg messages about PCIe Bus Error and AER).Hoping this report may help,
Best regards
The text was updated successfully, but these errors were encountered: