Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upData loss on rotated journal files on BTRFS volumes using compression #9112
Comments
|
Well, but what's there to do in journald really? File system bugs suck of course, but this is really something to fix in the kernel, no? Userspace is not there for working around kernel bugs, in particular not bugs that threaten data integrity like this. Or to say this differently: this really should be dealt with as high priority kernel bug and be puhed out quickly to distributions that way. I mean, even if journald would change behaviour here, the stuff that is already rotated away is not going to be magically fixed by that either is it? Really, this appears to me as something that should be dealt with outside of systemd, unless I am misunderstanding something here? |
|
Even with all the kernel and btrfs-progs fixes, BTRFS design is that once a file has been written to, it can't have NOCOW turned back off. I'm not sure if I agree with this, but it's documented. I'm sure there's more reasons for it, but doing otherwise would mean triggering a potentially hours long operation on extremely large (TB's) files. So, even with the fixes, journald is still trying to use BTRFS in a way it says it can't (turning NOCOW back off.) I understand having NOCOW on while the journal file is being heavily written to. But, it's a good idea to turn NOCOW off once a journal file is done changing, so it can be checksummed and received BTRFS' automatic correction and manual scrub protections. If nothing is done to journald, future journal files will forever remain unprotected. IMO, the best solution is when rotating a file on BTRFS for it to be
The stuff that is already rotated away won't be fixed by this, only journal files going forward. Existing journal files. IMO, it would be a good idea for journald to give already rotated files this same treatment, since journald has always intended for them to have COW back on, and did so in an unsupported way that never worked. (Except they'd need to be re-written to a temporary file and moved back over the original filename.) On volumes not using compression, rotated journald files should all still have the NOCOW attribute. Those aren't vulnerable to the compression and btrfs replace bugs, but they're still vulnerable to being easily corrupted without the checksum protection. |
|
Well, the file attributes have been introduced for btrfs, and we make use of them to deal with the btrfs case, but they are in naming and concept generic, and could be defined on any COW file system, not just btrfs. I mean, if turning off NOCOW again on btrfs is not supported, then it should just fail, which is entirely fine, as we ignore the errors on that, for precisely that reason. I mean, semantically, while we write to journal files we hint that NOCOW behaviour is preferable to our write pattern, after archival it's not. I think it's generally a good idea to let the kernel know about this, and we do, and it's up to the kernel to make use of that or not. Toggling a bit in a graceful way (i.e. ignoring errors) is simple, minimal and requires very little code. It's also generic, as we can just call the ioctl and it will do the right thing on COW file systems (if possible), and it will be a NOP on others, without us having to explicitly check for btrfs. Adding btrfs-specific code that copies the files on archival I think is very nasty OTOH as it would be relatively complex, synchronous, require quite some code and would be very specific to btrfs as we'd have to check for btrfs file systems first. Hence, I am pretty sure data corruption in btrfs is something that should be dealt with in btrfs, in the kernel, and journald is not the place to work around this, or add more btrfs-specific code really. |
Is NOCOW default in systemd for all COW filesystems, or just Btrfs? It can have unintended complications. For example compression and nocow are not compatible It there an easy way to prevent journald from setting nocow attributes during boot?
Of course. Errors in the file system should be handled there not matter of COW or NOCOW. |
Two locations need to be changed, because of several bugs causing a data loss risk on rotated journal files.
In short, if the journal files are on a BTRFS volume that uses compression, there has been a BTRFS bug that upon the rotation leaves the files NOCOW (this part is by design) but compresses the file anyway, and doesn't give it checksums. If a single bit goes bad, it will fail decompression and journald will be unable to read any of it. If
btrfs replaceis used to replace an existing drive with a new one, it is guaranteed to corrupt the journal files because the new drive will have an uncompressed version that is marked as being compressed, and the decompression routines will obviously fail. The locations are:src/journal/journal-file.c::journal_file_close()::378src/journal/journal-file.c::journal_file_open_reliably()::3555These need to either:
Remove the
chattr_*()andbtrfs_defrag*()calls, at least until (if) it is determined how the compression is happening and fixed or avoided.Be re-written so the file being rotated is not renamed, but instead a new file is created with the new name leaving COW on, and the contents copied over. (I don't think it would then need defragmentation since it was just written.) This of course affects timestamps unless modified to match, makes it a new inode, and breaks snapshot consistency.
(They could look if the BTRFS volume is mounted without compression, and if the parent directories are not causing compression to be inherited, and make no behavior change if both of these are true.)
In long:
systemd-journald originally had very poor performance when its journal files were on a BTRFS filesystem, due to its copy-on-write (COW) behavior. Starting in 219, new journal files were marked with the inode flag NOCOW, which disabled this behavior. Starting in 220, this was done in a way allowing the user to modify
tmpfiles.d/journal-nocow.confto change this behavior.On BTRFS, an existing inode that is marked nodatacow (NOCOW) can only have this flag removed if the file is empty. BTRFS simply does not allow this flag to be removed if the file has contents. A workaround is to copy the existing file to a new one without the nodatacow attribute, and move the new file over the original. But, unless this workaround is done manually, BTRFS silently ignores requests to remove the nodatacow attribute on non-empty files.
journal_file_close()::378attempts turning off FS_NOCOW_FL. If I understand the code correctly, when closing a journal file marked defrag on close, usechattr_fd()to turn FS_NOCOW_FL off (so turns COW back on), and submit it for defragmentation. The only time a journal file is marked defrag on close is atjournal_file_rotate()::3484, for the old file being rotated out. When rotating out, it's done by renaming the file at::3466, not involving copying it to a new file.journal_file_open_reliably()::3555also attempts turning off FS_NOCOW_FL. If I understand the code correctly, while opening a journal file through this function, if the journal format is corrupted, rotate it (rename it), usechattr_path()to turn FS_NOCOW_FL off (so turns COW back on), submit it for defragmentation, and retry opening it once.The calls to
chattr_*()ultimately callioctl(fd, FS_IOC_SETFLAGSand appropriately handle FS_NOCOW_FL, and the defrag is done by ultimately callingioctl(fd, BTRFS_IOC_DEFRAG, NULL).On BTRFS, turning on NOCOW (as acknowledged in systemd NEWS for 219) also disables BTRFS checksums. BTRFS also disables compression for NOCOW files. BTRFS intends for these NOCOW files, as long as anything has been written in them, to never be able to have NOCOW turned off, to never be able to have checksums, and to never be able to be compressed. There was a bug in BTRFS, that in certain circumstances, such as a NOCOW file being submitted for defragmentation while asking for compression, that a file would remain NOCOW and not have checksums, but be compressed anyway. This BTRFS bug has been patched, but hasn't been submitted to Linus yet.
There is another bug in BTRFS, that is being worked on, that running
btrfs replace(which replaces a disk in a BTRFS volume with another disk) will corrupt NOCOW files that are compressed, and because they don't have checksums, scrub and check will not catch the error. (A temporary workaround is to instead perform a btrfs device add then a remove.) The bug is that on the disk being added to the BTRFS volume, data will be written in its uncompressed form but still be marked as compressed. (It hasn't been determined what happens to the data between the size of the expected compressed data and the actual uncompressed data.) If the data is not mirrored, this will result in at least an I/O error, and at most, a kernel crash with internal kernel memory corruption, because BTRFS is submitting decompressed data for decompression. (The BTRFS causing kernel crash has been patched, but hasn't been submitted to Linus yet.) If the data is mirrored, whether the data is read correctly or has the I/O error or kernel crash depends on if BTRFS reads any of the file off the new device, or if it gets extremely lucky and reads all of the file off other devices.So, this all means that if a user has their journal files on a BTRFS volume mounted with compression, rotated journal files are still NOCOW, don't have checksums, but are compressed, which was never intended to be allowed by BTRFS. If the user then performs a device replace, they are guaranteed to corrupt any journal files being moved from the disk being removed to the one being added.
The peculiar thing is that I mentioned that BTRFS compresses a NOCOW file when it's submitted for defragmentation while asking for compression. This can be done manually by running
btrfs fi defrag -c. But, systemd does not do this, even programmatically. Because it usesBTRFS_IOC_DEFRAGrather thanBTRFS_IOC_DEFRAG_RANGE, it cannot even ask for compression. There's an unknown bug somewhere.Simply removing the
chattr_*()calls is unexpected by me to work. Unless the file is empty (but still being rotated, which is unlikely), the call is guaranteed not to do anything. But, this doesn't seem to be triggering compressing the file, as I extractedchattr_path()and the related systemd code verbatim, and cannot replicate in my own program causing the compression.A simple way to see if the bad behavior is happening is to run lsattr on the journal files. If the rotated ones have the "C" (NOCOW) attribute, the
chattr_*()calls aren't doing anything. Then, runfilefrag -von a rotated file. If the flags column includes "encoded", that means it's compressed. The combination of these two things should never happen. (Notefilefragdoesn't properly show the proper ending offsets or length for BTRFS compressed files.) A better way is of course to usebtrfs inspect-internal dump-tree, with more knowledge about how BTRFS works.To see linux-btrfs discussion of the above, search for subjects since 5/14/2018 including:
'"decompress failed" in 1-2 files always causes kernel oops', 'Don't compress if NODATASUM', 'Enhance btrfs handling compression', 'Avoid decompressing obviously corrupted data', 'Harden decompression callers to avoid', 'Add header length check to avoid slab out of bounds access', 'btrfs device replace can cause silent or noisy corruption on compressed NOCOW/NODATASUM', and 'Harden inline lzo compressed extent decompression'.
There will be more on linux-btrfs as this is still being worked on.