-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dm-crypt corruption issues (?) #200
Comments
Similar issue here: #196 Should we roll back the kernel? If you have a relatively easy and safe way to replicate, can you try a few past kernel versions? |
I'm currently trying ZFS with its own crypto layer, so if it's really dm-crypt (only) I shouldn't be affected anymore. If that's stable, and everything is set up again, I can do the smoketest with the external drive on various kernel versions and see if there's a pattern. |
Might be coincidental, but I hit some bad ext4 corruption yesterday on my M1, also using dm-crypt. It was rebuilding the kernel+mesa, and the compile started failing with gcc complaining one of the kernel .c files was filled with binary content. I noticed this in dmesg:
Rebooted to this, and ran a repair. I don't have the full fsck log, but it was large.
|
I've had some recent corruption issues as well rendering my primary partition unbootable; I don't have a log to provide, but I'm using an ext4 partition on LUKS. I'm not sure what it could be exactly. |
I tried reproducing the issue from there, by copying my I could not immediately reproduce it anymore, though that's a kernel with much more options enabled, essentially a distro kernel built with the asahi kernel sources (https://github.com/yu-re-ka/nixos-m1/tree/minimize-patches). |
I could reproduce this using ext4 + LUKS and btrfs + LUKS -- I didn't try for long, but it seemed like btrfs without LUKS was not exhibiting this issue (as observed by multiple scrubs without checksum errors). Wonder if this also happens on Fedora -- spent a bunch of time trying to find any mention of it but no luck -- I had managed to convince myself this was a hardware issue on my side until now :) |
I can reproduce it with a vanilla linux v6.8.9 on a M1 Pro Macbook (j316). That opens up the possibility to bisect it. The reproducer I am using is |
Can you post a bit more details on how to reproduce? I don't know tio and a quick search didn't turn up anything helpful in particular. |
Sorry, that was because I typoed the name. The tool is called fio: axboe/fio. It worked for me both with In the meantime my bisect also pointed me to 2632e2521769 ("arm64: fpsimd: Implement lazy restore for kernel mode FPSIMD") as the commit responsible. |
I double checked with the proper asahi kernel. It is fixed for me with the following commits reverted:
|
Do you want me to forward this report upstream? If yes, two short questions: Did you do the bisection with a vanilla kernel? Is vanilla 6.9 still showing the same problem? And does a revert help there, too (I assume all of that is the case, but sometimes it's better to be sure) [side note: I'm the Linux kernel's regression tracker; somebody pointed me here; normally I do not comment on downstream bug trackers, but I make an exception due to the data corruption aspect] |
ahh, I see, somebody reported it upstream already: https://lore.kernel.org/all/D1B7GPIR9K1E.5JFV37G0YTIF@shadowice.org/ great, thx! |
That was me reporting it, but thanks for the offer. |
Thanks all for the debugging efforts. I plan to do a NixOS Apple Silicon release with a revert patch within 24-48 hours, assuming the Asahi Linux kernel branch is not updated. |
Hej @mixi, |
You are right to be confused. Reverting 2632e2521769 alone is enough, and that is also the commit bisect pointed me to yesterday. Apparently I reverted one commit too many by accident and guessed I did it for context reasons when writing the comment afterwards. |
@tpwrules |
Latest release contains the revert. @flokli please close the issue if you are satisfied with that fix. |
Bad news: aefbab8e77eb ("arm64: fpsimd: Preserve/restore kernel mode NEON at context switch") also needs to be reverted. See https://lore.kernel.org/all/Zkw9kK0sXIgfqd01@shadowice/ for details, and a new reproducer that found the commit (the old one reproducibly sees the commit as good).
Correction: Apparently I reverted the right commit for the wrong reasons back then. |
Please try this fix, and report on the thread whether or not it works for you: |
Just to make sure, is this a fix to be applied on top of any reverts (and if so, which), or an attempt to fix without reverting anything else? |
The latter. |
With I guess what's left here is bumping |
PR up at #202 |
#202 has been merged (bumping the kernel to On the upstream kernel side, I however noticed the fix only landed in the master branch so far - meaning other aarch64 machines running the mainline kernel might still run into this corruption. @knurd is there anything else left to be done so this gets cherrypicked to linux-6.9.y, so it'll land in |
That's likely too late, as 6.9.2 is in its -rc phase already – and usually Greg does not add any patches at that point aiui. You could ask though. But it likely should go into 6.9.3 dues to the "CC: stable..." tag in the commit. |
(trying here, as I don't have that ML subscribed): Hey @gregkh, any chance "arm64/fpsimd: Avoid erroneous elide of user state reload" could still end up in 6.9.2, due to its data corruption nature? |
Please send stable requests to stable@vger.kernel.org, we can't take stuff from random github repos for obvious reasons. |
The commit I linked had a Cc: stable in the message. That's sufficient? |
Up to Greg, but I'd say it's in the everyone's best interest if you write a quick mail to the list (like with most Linux kernel lists, you don't have to be subscribed!) with Greg CCed (side note: you might ask for the patch to be included in 6.8.y, too) – that among others is also important for the paper trail in case the question "who asked for this to be included" comes up later. |
@knurd Sent out an email to stable@, both you and greg are in CC. |
In the last few days I've been running into a bunch of btrfs corruption issues on my Macbook M2 Air. I initially suspected a single fluke, but it got worse.
Yesterday I entirely re-created the filesystem (luks with
--allow-discards
), thenmkfs.btrfs
with default params, and again got btrfs errors.It seems I can rule out the internal SSD internal, as the same issues also happens on a (somewhat reliable and fast) external SSD (formatted with LUKS and btrfs).
This was after copying my
/nix/store
from the host to/mnt
, and unmounting.dmesg of the host:
The text was updated successfully, but these errors were encountered: