Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log files moderate fragmentation #22940

Closed
birdie-github opened this issue Apr 1, 2022 · 20 comments
Closed

Log files moderate fragmentation #22940

birdie-github opened this issue Apr 1, 2022 · 20 comments
Labels

Comments

@birdie-github
Copy link

I'm using systemd-249.9-1.fc35.x86_64 in Fedora 35 with pretty much everything by default.

I'm quite appalled by how horribly fragmented systemd log files are.

e4defrag 1.46.3 (27-Jul-2021)
ext4 defragmentation for ./log/journal/x//system@y-0000000000011a30-z.journal
[1/1]./log/journal/x/system@y-0000000000011a30-z.journal:	100%  extents: 43

The other log files extents are: 39, 28, etc. All files are exactly 8388608 bytes.

I was under the impression that systemd is capable of preallocating file space using fallocate(2), why doesn't it do it?

Please do it by default. This must not be happening.

@birdie-github birdie-github changed the title Huge log files fragmentation Log files huge fragmentation Apr 1, 2022
@poettering
Copy link
Member

I was under the impression that systemd is capable of preallocating file space using fallocate(2), why doesn't it do it?

We do use fallocate. Whenever we append something we allocate more space via fallocate, in 8M steps.

@birdie-github
Copy link
Author

"Thanks" for immediately closing the issue despite the evidence that this is not working as intended.

@birdie-github
Copy link
Author

All the nine files that I have are in "8M steps" and all of them are fragmented as hell.

@birdie-github
Copy link
Author

Not even trying to confirm or deny the bug report just closing it.

Amazing attitude.

@mrc0mmand
Copy link
Member

Reopening this to give it a closer look (and maybe /cc @DaanDeMeyer).

@mrc0mmand mrc0mmand reopened this Apr 1, 2022
@vcaputo
Copy link
Member

vcaputo commented Apr 1, 2022

With the hole-punching of archives change it's not exactly unexpected for archived journals to be fragmented.

@birdie-github
Copy link
Author

birdie-github commented Apr 2, 2022

I'm using ext4 with a ton of free space.

Mount options: defaults,noatime,discard

Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      ext_attr resize_inode dir_index filetype
                          extent 64bit flex_bg sparse_super large_file
                          dir_nlink extra_isize metadata_csum
Filesystem flags:         signed_directory_hash 
Default mount options:    user_xattr acl

Filesystem       Size  Used Avail Use% Mounted on
/dev/root         20G  6.2G   13G  33% /

      114807 inodes used (8.76%, out of 1310720)
          40 non-contiguous files (0.0%)
         129 non-contiguous directories (0.1%)
             # of inodes with ind/dind/tind blocks: 0/0/0
             Extent depth histogram: 98609/29
     1687058 blocks used (32.18%, out of 5242880)
           0 bad blocks
           1 large file

@DaanDeMeyer
Copy link
Contributor

Fragmentation is indeed expected because we grow the file at 8MB increments using ftruncate() and punch holes in it when archiving. When BTRFS is used, we rewrite the entire file so we can enable COW. The side effect is that we get rid of fragmentation as well. We could rewrite the file unconditionally to always get rid of fragmentation but that would end up doubling our write rate since we'd write every journal file twice effectively.

Without gathering some actual data to compare the tradeoffs, I have no idea whether this would be a good idea or not.

(Of course when rewriting there's a few extra things we could do, like coalescing entry arrays that might make rewriting worth it)

@poettering
Copy link
Member

Fragmentation is expected for files written piece meal. I see no problem wit that. 40 fragments isn't terrible. 40000 would be terrible. Quite frankly I doubt it's worth the fuss. We do what we can to minimize fragments, and the results are not terrible, so unless people can show it's worth generating additional IO to remove the fragments on archival I am not sure we should really bother.

btrfs with cow is a different story, since writing to the middle of files will cause heavy fragmentation, way beyond what is seen on ext4, that's why we rewrite files with cow disabled, and rewrite it to reenable it on archival.

Anyway, I'd just close this. Files with complex write patters cause fragmentation, there's no news in that.

@vcaputo
Copy link
Member

vcaputo commented Apr 5, 2022

That's fair but in this particular issue it's 8MiB journals and we allocate them in 8MiB chunks, so that's not really relevant.

My assumption is this is only the archived journals and due to the hole punching. WRT @DaanDeMeyer's suggestions on rewriting archives to defragment them, it really feels like the kernel should just be giving us an ioctl to say "hey could you optimize this file when you get a chance since I'm done writing to it, kthx". That way the underlying filesystem could do clever things about making in-place reorganization and/or compression succeed in low space scenarios. It feels like we're working around shortcomings doing it in userspace, and depriving the filesystems of the opportunity to do it better.

@DaanDeMeyer
Copy link
Contributor

My assumption is this is only the archived journals and due to the hole punching. WRT @DaanDeMeyer's suggestions on rewriting archives to defragment them, it really feels like the kernel should just be giving us an ioctl to say "hey could you optimize this file when you get a chance since I'm done writing to it, kthx". That way the underlying filesystem could do clever things about making in-place reorganization and/or compression succeed in low space scenarios. It feels like we're working around shortcomings doing it in userspace, and depriving the filesystems of the opportunity to do it better.

There's one of those for btrfs (BTRFS_IOC_DEFRAG) but I agree it'd be nice to have a generic ioctl that did it for us regardless of which FS is used.

@poettering
Copy link
Member

poettering commented Apr 5, 2022

there's a defrag ioctl, and we actually used to issue it (but that was dropped in #21598, though I think mostly by accident). But not sure it's supported outside of btrfs.

Defragging is not an obvious choice though. The IO you generate this way doesn't come for free, and thus the benefit of removing fragments must heavily outscale the benefit of minimal IO.

I don't see that here. My educated guess is that 40 frags don't matter, 40000 would. Unless anyone actually shows that generating a lot of defrag io for maybe reducing 40 frags to a bit less is worth it I think we should close this.

@poettering
Copy link
Member

There's one of those for btrfs (BTRFS_IOC_DEFRAG) but I agree it'd be nice to have a generic ioctl that did it for us regardless of which FS is used.

in the fs layer, usually ioctls start out in one fs and then get renamed when implemented in others.

@poettering poettering changed the title Log files huge fragmentation Log files moderate fragmentation Apr 5, 2022
@vcaputo
Copy link
Member

vcaputo commented Apr 5, 2022

8MiB archives are kind of an edge case scenario, this gh issue arguably exists because of that IMO. Nobody aware of journald's preallocation of 8MiB increments would expect to see 40 fragments backing them on an empty filesystem, so it's totally understandable that @birdie-github filed this under the impression something is misbehaving. Through the lens of their minimal size, one might extrapolate that indeed you'd have problematic numbers of fragments for a larger journal.

It's this very same edge case that drove me to uncover such disproportionate wasted space in preallocated tiny journals. Now instead of a complaint that the space is wasted, we have a complaint that the file is fragmented, thanks to hole-punching. But I think this particular fragmentation is probably harmless.

Closing is fine with me, @birdie-github does what you're seeing make more sense now?

@birdie-github
Copy link
Author

Closing is fine with me, @birdie-github does what you're seeing make more sense now?

Not really or maybe I'm too stupid. What I've picked up is that instead of preallocating space you punch a hole and it's bound to be fragmented because holes are not guaranteed to be continuous.

Would be great if there were a config option for that. I don't use BTRFS and I don't really care about other CoW fileystems.

@vcaputo
Copy link
Member

vcaputo commented Apr 8, 2022

Closing is fine with me, @birdie-github does what you're seeing make more sense now?

Not really or maybe I'm too stupid. What I've picked up is that instead of preallocating space you punch a hole and it's bound to be fragmented because holes are not guaranteed to be continuous.

It's not instead of preallocating. It's that when a journal gets archived, in the interests of reclaiming wasted space, a hole-punching pass is performed to deallocate substantial unused (zeroed) regions. It's basically sparsifying the archive. It's really not a big deal, accesses in holes are being fulfilled with generated zeroes now without hitting the backing store at all. The fragments straddling the holes should be left in their prior layout from the contiguous preallocation. The holes have just been made available for reuse, it's up to the filesystem to pack appropriately sized objects in those holes without ill effect for accessing those objects.

Can you demonstrate an actual measurable performance problem resulting from this? If not, it's mostly just cosmetic, and please close the issue.

@birdie-github
Copy link
Author

As far as I understand this functionality, "a hole-punching pass", is only needed for CoW filesystems. If I'm correct would be nice to get an option to disable it altogether.

@vcaputo
Copy link
Member

vcaputo commented Apr 9, 2022

As far as I understand this functionality, "a hole-punching pass", is only needed for CoW filesystems. If I'm correct would be nice to get an option to disable it altogether.

That's not the case. We added hole-punching when archiving to reclaim space journald prepared for use but ended up archiving the file before actually writing anything other than zeroes into. This doesn't only apply to CoW filesystems, what gives you that impression?

@poettering
Copy link
Member

Anyway, closing for now. We can certainly reopen this if anybody can show this is a real performance bottleneck, and that the benefit of defragging or so could bring real benefits. But without numbers, a 40 frags doesn't make me nervous... Hope it's ok if I hence close

@birdie-github
Copy link
Author

Is it possible to disable this hole-punching/rotation/whatever? I see nothing in man journald.conf.

It'd be great if old files were simply deleted and new binary log files got created.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

5 participants