-
-
Notifications
You must be signed in to change notification settings - Fork 815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fd is allocation heavy, which can lead to heavy lock contention on some malloc implementations, especially hardened ones #710
Comments
Thank you very much for the detailed report and analysis! I have a couple of questions and remarks:
I tried to reproduce your results by running: SEARCH_ROOT="$HOME/some-large-folder"
benchmark() {
libc="$1"
hyperfine \
--parameter-scan threads 1 8 \
--warmup 5 \
--export-json "fd-threads-$libc.json" \
--export-markdown "fd-threads-$libc.md" \
"./fd-v8.2.1-$libc -j {threads} nonexisting $SEARCH_ROOT"
}
benchmark gnu
benchmark musl
hyperfine/scripts/plot_parametrized.py \
fd-threads-gnu.json \
fd-threads-musl.json \
--titles gnu,musl And I see the following results (I have a 4-core system with hyperthreading): Some remarks:
Possibly related: #616 ? |
I'm not sure what you'd consider statistically relevant here. They are reproducible and the difference between runs is small. I tried hyperfine below. With musl:
With glibc:
I'm always working with a hot cache.
It gives me 450 results; the search from the tests above have no results, so the issue is unlikely to be locking when writing to stdout (assuming that's what you were thinking about).
I couldn't get
The jemalloc (on musl) performance was quite similar to glibc, so I think the outlier here is musl.
It definitely seems reproducible here (see tests above)... Do you know what version of musl you're using? Given the general theme here, that just looks like the present malloc issue can get even worse; so if we can fix that, I'm reasonably confident the issue would resolve itself. |
That's what I wanted to know 😄 👍
Ok, so that's ruled out. Yes - that's what I was thinking about.
My mistake. I accidentally looked at the "usr time" results for the
I actually don't :-o. I don't have musl installed. It seems to come with the Rust toolchain? How did you even manage to build a dynamically linked musl version of A
How would we go about this? I can certainly imagine that the recursive directory walker is quite heavy on allocations ( We should probably find some way to count/measure all kinds of allocations ( A quick fix would be to enable jemalloc for musl builds. For some reason, I deactivated them originally (#481), but we could try to re-activate this: Lines 35 to 44 in 6a3aae5
|
Ah, that'd make sense. I have no idea what Rust is currently shipping, then; they might be stuck with older musl temporarily, since musl 1.2.x changed
Not sure that's possible; the allocation happens for the
I remember reading reports about jemalloc + musl being broken for Rust (or even in general), so this might not work at all and/or be unstable in some way. I don't know the current state. |
It's not plausible that allocations by Moreover, |
I'm not sure on the allocation, but it does seem to use a function that will usually require locking: https://github.com/rust-lang/rust/blob/ddafcc0b66e04ac4793361d3bd604c8c8a9d16a8/library/std/src/sys/unix/fs.rs#L493 I will see if I can get a backtrace on the lock, since it might not be due to malloc after all... Or maybe this (from right above it)
does allocations? |
I don't actually know Rust, but I think the quoted code necessitates allocation. It should probably be reusing internal storage or if there's an API contract to keep the object valid after the next |
although I am not overly familiar with Rust, the |
That raises another issue, that Rust's API seems to impose an assumption that the platform has a |
this is only on non-solaris/fuchsia/redox/illumos. on those platforms, the name is copied into a heap-allocated buffer: |
I don't think so.
Exactly 23848 directories are being scanned, yes. No additional allocations seem to happen. Note that the heaptrack run was performed on the GNU libc version (as I don't have a dynamically linked musl version of
That would be great. But as you mentioned, your |
I had forgotten to grab proper debug info. With it available, I can better show what's going on. The flame graphs above are, respectively, the 1, 4, 8 and 12 thread runs. From a 12 thread run, the main culprits for calling
Using
|
|
Could you please go into a bit more detail? |
I'm assuming it was caused by #863, the 8.3.0 release seems to have improved performance on musl. For a search with many results, where musl was taking about ~2x the time glibc was, musl is now taking approximately ~1.25x the time. Unfortunately, when there aren't result matches, which means no printing, musl is still taking ~3x the time that glibc is. |
@tavianator FYI |
Well, I can reproduce this:
There is a lot of allocation contention as this bug report suggests. I don't think this can be easily worked around in Well, I can think of one workaround: diff --git a/Cargo.toml b/Cargo.toml
index 203b98e..d8864f0 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -64,7 +64,7 @@ libc = "0.2"
# FIXME: Re-enable jemalloc on macOS
# jemalloc is currently disabled on macOS due to a bug in jemalloc in combination with macOS
# Catalina. See https://github.com/sharkdp/fd/issues/498 for details.
-[target.'cfg(all(not(windows), not(target_os = "android"), not(target_os = "macos"), not(target_os = "freebsd"), not(target_env = "musl")))'.dependencies]
+[target.'cfg(all(not(windows), not(target_os = "android"), not(target_os = "macos"), not(target_os = "freebsd")))'.dependencies]
jemallocator = "0.3.0"
[dev-dependencies]
diff --git a/src/main.rs b/src/main.rs
index 4f4ab14..371a1ce 100644
--- a/src/main.rs
+++ b/src/main.rs
@@ -39,7 +39,6 @@ use crate::regex_helper::{pattern_has_uppercase_char, pattern_matches_strings_wi
not(target_os = "android"),
not(target_os = "macos"),
not(target_os = "freebsd"),
- not(target_env = "musl")
))]
#[global_allocator]
static ALLOC: jemallocator::Jemalloc = jemallocator::Jemalloc; Much better:
|
I don't know what to add at this point except that, if |
I might be mistaken, but it is my understanding that all Rust APIs that end up making a syscall, because they receive a OsStr, will end up allocating in order to generate a CStr to pass to the libc functions. Idk if removing that occurrence of allocations would help these results much. |
@richfelker That's probably true but likely impractical for this project, as the paths and whatever else are dynamically allocated by other libraries that we don't control. If I'm not mistaken, the current implementation of mallocng takes an exclusive lock even for reads. Is it easy to convert it to a real rwlock? That might help a bit. |
The design facilitates rwlock but was found not to be of any benefit, because the critical section is tiny and the cost is the memory-synchronizing operation itself, not exclusion due to the lock. As far as I can tell this is fundamental to global consistency, not any deficiency of implementation. It's vaguely possible that a more specialized rwlock (vs the general purpose pthread primitive) could do slightly better, but based on everything I've seen so far, I doubt it. |
Traditional thread caching is incompatible with the consistency checking in hardened_malloc and musl's malloc-ng. They could use batched allocation for malloc by having a thread-local array for the smallest size classes with a small number (4-16) of allocations to amortize the cost of locking for malloc but they can't put allocations back into it on free like glibc and jemalloc. That means there's inherently going to be the cost of locking/unlocking for each call to malloc (at least without batching) and free. Even an uncontended lock is expensive. hardened_malloc may add optional batched allocation support. It's worth noting that despite being a hardened allocator focused on security properties, hardened_malloc can be used in a lightweight configuration with comparable performance to glibc malloc and jemalloc with their thread caches disabled. If you disable all the optional security features other than slab canaries, it still provides better security properties than those in every way. There's an inherently high cost to most of the other optional features that are enabled by default (zero on free, write after free check, frequent guard slabs and especially the slab quarantines) and the optional slot randomization currently has a fair bit of overhead but we expect that we can optimize it to be quite cheap when built specifically for modern x86_64 CPUs via the PDEP feature. hardened_malloc supports jemalloc-style arenas so it can scale well to multiple cores. It has round-robin assignment of threads to arenas similar to jemalloc. The non-Android configuration uses 4 arenas by default but we use 1 arena on Android by default. Enabling more arenas makes it scale better to more threads but the security-focused approach in hardened_malloc needs to reserve a massive amount of address space for each arena. The typical 47-bit userspace address space (48 total) on x86_64 isn't enough for a lot more than 4 arenas. Linux on mobile arm64 typically has only a 38-bit userspace address space (39 total) due to using 3-level page tables as an optimization, but it can be configured to have 47-bit like x86_64. The arena approach fits nicely into hardened_malloc but it doesn't really play well with the optional slab quarantine feature. Arenas multiply the worst case cached / quarantined memory of the allocator since they're separate instances with their own cached/quarantined memory. The quarantine feature purposely holds onto freed allocations and can't easily purge any memory for them. Multiplying that for each arena makes it much worse. We may end up making the quarantines much smaller by default and not quarantining the largest slab allocation sizes by default but until then it's a strong pressure against using many arenas. We use only 1 arena on GrapheneOS since we use both of the slab quarantine features. Making the slab quarantine features lighter is an open problem but there's unlikely to be any substantial improvement beyond choosing a different security vs. memory usage balance for the default configuration which is more slanted towards lighter memory usage. OpenBSD malloc also now supports a similar arena approach, although it doesn't require lots of address space since they don't provide strict isolated regions and use a hash table to find the metadata for a slab or large allocation rather than addresses. OpenBSD malloc is far slower than hardened_malloc though. The whole point of hardened_malloc was making a replacement for OpenBSD malloc on 64-bit with better performance and also security via isolated regions without address space reused between them and making the optional security features fast enough to have them all enabled by default. musl could use arenas via inline metadata specifying the arena (which would end up being verified through the other checks) but multiplying the cached memory goes against the lightweight memory approach. It could be an option though and wouldn't be that complicated. It would probably only be around 50 lines of code. I don't really see that happening since musl is focused on being lightweight on memory usage and simplicity. Perhaps it's worth trying it to see how much benefit it would offer though. It's also not quite as bad as multiplying the worst case cached memory in practice as long as threads are statically assigned to arenas. It depends on the workload. |
This was the initial arena implementation for hardened_malloc (with w=1 as a URL parameter to hide whitespace-only changes): GrapheneOS/hardened_malloc@c5e9114?w=1 It was subsequently switched to using round-robin static assignment of threads to arenas instead of the placeholder random static assignment of threads to arenas and also micro-optimized. It hasn't changed significantly from the initial implementation in terms of approach. The difference for malloc-ng would be needing to find the arena from inline metadata instead of address range. It's really fairly trivial to implement. I just don't knows how well it fits with what @richfelker is trying to for musl's malloc-ng since it raises memory usage. It would be pretty lightweight if it was only 2-4 arenas though. In jemalloc, they make 4 arenas for every logical thread which makes it quite heavyweight, and in hardened_malloc it's slab quarantines that make it heavyweight when those are enabled, since they're heavyweight themselves. |
Also worth noting: hardened_malloc has entirely separate allocators for each small (slab) allocation size class within each arena with separate locks. It has global locking for large allocations and then a lock for each per-size-class-per-arena allocator. There are no other locks. It has no shared state since it doesn't share anything across size classes for security reasons as part of that isolated size class design. I don't think that per-size-class locking reduces contention that much though. We wouldn't have gone out of the way to do it. It makes sense because the ONLY cost for us to provide it is needing a ChaCha8 CSPRNG state struct for each size class allocator rather than one per arena. It doesn't actually increase memory usage due to how metadata is split up anyway. It's not like it matters in a typical security-focused configuration using slab quarantines since those are so heavy... |
Fixes sharkdp#710. Fixes sharkdp#980.
Fixes sharkdp#710. Fixes sharkdp#980.
Partial fix for sharkdp#710 and sharkdp#980.
Could someone please benchmark fd built with mimalloc instead of jemalloc? |
Should be easy to do using https://docs.rs/mimalloc/latest/mimalloc/. We have a benchmark suite here: https://github.com/sharkdp/fd-benchmarks (see regression.sh) |
EDIT: Ah, happy new year!
Describe the bug you encountered:
When comparing the speed of
fd
with musl vs glibc, they are very comparable with-j1
, but when the core count goes above a certain threshold, musl gets diminishing returns, to the point that with-j12
, on a 6-core system (with hyperthreading), it's actually slower than with-j4
.With fd + musl:
With fd + musl + pre-loaded jemalloc:
With fd + glibc:
With
perf
, I can track the issue (apparently) down to__lock
:This matches with what was observed: with more threads, you get more lock contention, and can end up spending way too much time locked. I'd expect a similar effect to be observable with
hardened_malloc
from GrapheneOS and possibly OpenBSD's malloc. Theoretically speaking, rust might not require a hardened malloc, but you usually don't want to carry your own malloc for each application either, so it'd be nice if this could perform better "out of the box", by simply performing less allocations (which helps by removing most of the dependency on malloc performance). That said, I don't know how feasible fixing this is.Describe what you expected to happen:
I'd expected all versions of the test to show similar growth in execution speed. Note that this is in my
~/
directory, because otherwise execution is short enough that you can't spot the issue (the numbers with a tarball of the linux kernel were too close).What version of
fd
are you using?fd 8.2.1
Which operating system / distribution are you on?
Linux (kernel 5.10.2), with musl 1.2.1
glibc used for testing was 2.30
jemalloc used for
LD_PRELOAD
was 5.2.1The text was updated successfully, but these errors were encountered: