Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fd is allocation heavy, which can lead to heavy lock contention on some malloc implementations, especially hardened ones #710

Open
ericonr opened this issue Dec 31, 2020 · 27 comments

Comments

@ericonr
Copy link

ericonr commented Dec 31, 2020

EDIT: Ah, happy new year!

Describe the bug you encountered:

When comparing the speed of fd with musl vs glibc, they are very comparable with -j1, but when the core count goes above a certain threshold, musl gets diminishing returns, to the point that with -j12, on a 6-core system (with hyperthreading), it's actually slower than with -j4.

With fd + musl:

~ ➜ time fd -j1 freebsd ~/ >/dev/null

________________________________________________________
Executed in    3.17 secs   fish           external
   usr time    1.81 secs    0.00 micros    1.81 secs
   sys time    1.36 secs  599.00 micros    1.36 secs

~ took 3s ➜ time fd -j4 freebsd ~/ >/dev/null

________________________________________________________
Executed in    2.11 secs   fish           external
   usr time    5.40 secs  621.00 micros    5.40 secs
   sys time    2.47 secs   46.00 micros    2.47 secs

~ took 2s ➜ time fd -j8 freebsd ~/ >/dev/null

________________________________________________________
Executed in    2.62 secs   fish           external
   usr time   11.58 secs  476.00 micros   11.58 secs
   sys time    7.39 secs   35.00 micros    7.39 secs

~ took 2s ➜ time fd -j12 freebsd ~/ >/dev/null

________________________________________________________
Executed in    2.98 secs   fish           external
   usr time   16.22 secs    0.00 micros   16.22 secs
   sys time   15.78 secs  559.00 micros   15.78 secs

With fd + musl + pre-loaded jemalloc:

~ took 2s ➜ time LD_PRELOAD=/usr/lib/libjemalloc.so.2 fd -j1 freebsd ~/ >/dev/null

________________________________________________________
Executed in    2.75 secs   fish           external
   usr time    1.43 secs    0.00 micros    1.43 secs
   sys time    1.32 secs  913.00 micros    1.32 secs

~ ➜ time LD_PRELOAD=/usr/lib/libjemalloc.so.2 fd -j4 freebsd ~/ >/dev/null

________________________________________________________
Executed in  790.87 millis    fish           external
   usr time    1.75 secs  930.00 micros    1.75 secs
   sys time    1.37 secs    0.00 micros    1.37 secs

~ ➜ time LD_PRELOAD=/usr/lib/libjemalloc.so.2 fd -j8 freebsd ~/ >/dev/null

________________________________________________________
Executed in  552.14 millis    fish           external
   usr time    2.40 secs    0.00 micros    2.40 secs
   sys time    1.89 secs  882.00 micros    1.89 secs

~ ➜ time LD_PRELOAD=/usr/lib/libjemalloc.so.2 fd -j12 freebsd ~/ >/dev/null

________________________________________________________
Executed in  539.76 millis    fish           external
   usr time    3.42 secs  828.00 micros    3.42 secs
   sys time    2.76 secs    0.00 micros    2.76 secs

With fd + glibc:

[chroot] /home/ericonr > time fd -j1 freebsd ~/ >/dev/null

________________________________________________________
Executed in    2,73 secs   fish           external
   usr time    1,28 secs  488,00 micros    1,28 secs
   sys time    1,35 secs  141,00 micros    1,35 secs

[chroot] /home/ericonr > time fd -j4 freebsd ~/ >/dev/null

________________________________________________________
Executed in  854,58 millis    fish           external
   usr time    1,75 secs    0,00 micros    1,75 secs
   sys time    1,61 secs  585,00 micros    1,61 secs

[chroot] /home/ericonr > time fd -j8 freebsd ~/ >/dev/null

________________________________________________________
Executed in  601,37 millis    fish           external
   usr time    2,25 secs  439,00 micros    2,25 secs
   sys time    2,35 secs  144,00 micros    2,35 secs

[chroot] /home/ericonr > time fd -j12 freebsd ~/ >/dev/null

________________________________________________________
Executed in  558,01 millis    fish           external
   usr time    3,10 secs  612,00 micros    3,10 secs
   sys time    3,15 secs    0,00 micros    3,15 secs

With perf, I can track the issue (apparently) down to __lock:

  Children      Self  Command  Shared Object     Symbol
+   29.83%    29.58%  fd       libc.so           [.] __lock
+   27.29%     0.00%  fd       libc.so           [.] a_cas (inlined)

This matches with what was observed: with more threads, you get more lock contention, and can end up spending way too much time locked. I'd expect a similar effect to be observable with hardened_malloc from GrapheneOS and possibly OpenBSD's malloc. Theoretically speaking, rust might not require a hardened malloc, but you usually don't want to carry your own malloc for each application either, so it'd be nice if this could perform better "out of the box", by simply performing less allocations (which helps by removing most of the dependency on malloc performance). That said, I don't know how feasible fixing this is.

Describe what you expected to happen:

I'd expected all versions of the test to show similar growth in execution speed. Note that this is in my ~/ directory, because otherwise execution is short enough that you can't spot the issue (the numbers with a tarball of the linux kernel were too close).

What version of fd are you using?

fd 8.2.1

Which operating system / distribution are you on?

Linux (kernel 5.10.2), with musl 1.2.1

glibc used for testing was 2.30

jemalloc used for LD_PRELOAD was 5.2.1

@ericonr ericonr added the bug label Dec 31, 2020
@sharkdp
Copy link
Owner

sharkdp commented Jan 2, 2021

Thank you very much for the detailed report and analysis!

I have a couple of questions and remarks:

  • Are these benchmark results reliable? In the sense of: statistically significant?
  • Did you account for I/O caching effects (i.e.: either reset disk caches between runs or perform warmup runs)?
  • How many results does the fd freebsd search yield in your home directory? (how many results need to be written to /dev/null?)
  • How can we be sure that this is related to lock contention on malloc? Couldn't it be another shared resource that is the bottleneck? Like .. the file system?

I tried to reproduce your results by running:

SEARCH_ROOT="$HOME/some-large-folder"

benchmark() {
    libc="$1"
    hyperfine \
        --parameter-scan threads 1 8 \
        --warmup 5 \
        --export-json "fd-threads-$libc.json" \
        --export-markdown "fd-threads-$libc.md" \
        "./fd-v8.2.1-$libc -j {threads} nonexisting $SEARCH_ROOT"
}

benchmark gnu
benchmark musl

hyperfine/scripts/plot_parametrized.py \
  fd-threads-gnu.json \
  fd-threads-musl.json \
  --titles gnu,musl

And I see the following results (I have a 4-core system with hyperthreading):
image

Some remarks:

  • I tried to run another set of benchmarks with a LD_PRELOADed jemalloc for the musl version, but I see no difference in timing to the plain musl version. It seems like LD_PRELOAD=/usr/lib/libjemalloc.so.2 doesn't work for me (the library does exist). Do I need a special version of libjemalloc for that? My mistake. I was using a statically linked version of fd-musl 🙈
  • The difference between GNU libc and musl is huge. From this experience I would have expected a slow-down of 30%, but not a factor of 2-3 😦.
  • I can't seem to reproduce your results. There doesn't seem to be a (significant) slowdown for j=$(nproc)=8 threads.

Possibly related: #616 ?

@ericonr
Copy link
Author

ericonr commented Jan 3, 2021

Are these benchmark results reliable? In the sense of: statistically significant?

I'm not sure what you'd consider statistically relevant here. They are reproducible and the difference between runs is small. I tried hyperfine below.

With musl:

~ took 22s ➜ hyperfine -i 'fd -j1 akjdakljd'
Benchmark #1: fd -j1 akjdakljd
  Time (mean ± σ):      3.540 s ±  0.045 s    [User: 1.978 s, System: 1.538 s]
  Range (min … max):    3.494 s …  3.621 s    10 runs

~ took 35s ➜ hyperfine -i 'fd -j4 akjdakljd'
Benchmark #1: fd -j4 akjdakljd
  Time (mean ± σ):      2.241 s ±  0.070 s    [User: 5.783 s, System: 2.644 s]
  Range (min … max):    2.125 s …  2.341 s    10 runs

~ took 22s ➜ hyperfine -i 'fd -j8 akjdakljd'
Benchmark #1: fd -j8 akjdakljd
  Time (mean ± σ):      3.285 s ±  0.125 s    [User: 14.704 s, System: 9.194 s]
  Range (min … max):    3.058 s …  3.459 s    10 runs

~ took 32s ➜ hyperfine -i 'fd -j12 akjdakljd'
Benchmark #1: fd -j12 akjdakljd
  Time (mean ± σ):      3.493 s ±  0.208 s    [User: 19.171 s, System: 18.874 s]
  Range (min … max):    3.130 s …  3.705 s    10 runs

With glibc:

[chroot] /home/ericonr > hyperfine -i 'fd -j1 akjdakljd'
Benchmark #1: fd -j1 akjdakljd
  Time (mean ± σ):      2.724 s ±  0.023 s    [User: 1.326 s, System: 1.380 s]
  Range (min … max):    2.699 s …  2.762 s    10 runs

[chroot] /home/ericonr > hyperfine -i 'fd -j4 akjdakljd'
Benchmark #1: fd -j4 akjdakljd
  Time (mean ± σ):     914.6 ms ±  33.4 ms    [User: 1.823 s, System: 1.781 s]
  Range (min … max):   886.5 ms … 979.2 ms    10 runs

[chroot] /home/ericonr > hyperfine -i 'fd -j8 akjdakljd'
Benchmark #1: fd -j8 akjdakljd
  Time (mean ± σ):     725.3 ms ±  12.7 ms    [User: 2.865 s, System: 2.755 s]
  Range (min … max):   705.4 ms … 741.8 ms    10 runs

[chroot] /home/ericonr > hyperfine -i 'fd -j12 akjdakljd'
Benchmark #1: fd -j12 akjdakljd
  Time (mean ± σ):     718.1 ms ±  18.9 ms    [User: 3.969 s, System: 3.936 s]
  Range (min … max):   675.7 ms … 738.4 ms    10 runs

Did you account for I/O caching effects (i.e.: either reset disk caches between runs or perform warmup runs)?

I'm always working with a hot cache.

How many results does the fd freebsd search yield in your home directory? (how many results need to be written to /dev/null?)

It gives me 450 results; the search from the tests above have no results, so the issue is unlikely to be locking when writing to stdout (assuming that's what you were thinking about).

How can we be sure that this is related to lock contention on malloc? Couldn't it be another shared resource that is the bottleneck? Like .. the file system?

I couldn't get perf to show me a backtrace with __lock being called by malloc, so I can't say with absolute certainty that it was the culprit. I can say, however, that I have good evidence for it, since:

  • on the same machine, kernel and files, glibc doesn't have the issue
  • the issue vanished when I pre-loaded a different malloc implementation

The difference between GNU libc and musl is huge. From this experience I would have expected a slow-down of 30%, but not a factor of 2-3 frowning.

The jemalloc (on musl) performance was quite similar to glibc, so I think the outlier here is musl.

I can't seem to reproduce your results. There doesn't seem to be a (significant) slowdown for j=$(nproc)=8 threads.

It definitely seems reproducible here (see tests above)... Do you know what version of musl you're using? Given the general theme here, that just looks like the present malloc issue can get even worse; so if we can fix that, I'm reasonably confident the issue would resolve itself.

@sharkdp
Copy link
Owner

sharkdp commented Jan 3, 2021

I'm not sure what you'd consider statistically relevant here. They are reproducible and the difference between runs is small

That's what I wanted to know 😄 👍

How many results does the fd freebsd search yield in your home directory? (how many results need to be written to /dev/null?)

It gives me 450 results; the search from the tests above have no results, so the issue is unlikely to be locking when writing to stdout (assuming that's what you were thinking about).

Ok, so that's ruled out. Yes - that's what I was thinking about.

I couldn't get perf to show me a backtrace with __lock being called by malloc, so I can't say with absolute certainty that it was the culprit. I can say, however, that I have good evidence for it, since:

* on the same machine, kernel and files, glibc doesn't have the issue

* the issue vanished when I pre-loaded a different malloc implementation

My mistake. I accidentally looked at the "usr time" results for the LD_PRELOAD benchmark and didn't see the huge difference.

Do you know what version of musl you're using?

I actually don't :-o. I don't have musl installed. It seems to come with the Rust toolchain? How did you even manage to build a dynamically linked musl version of fd? (rust-lang/rust#34987)

A perf profile looks quite different for me:

  Children      Self  Command  Shared Object     Symbol
+   39,68%    39,68%  fd       [unknown]         [k] 0xffffffff8c000163                                            ◆
+   18,30%     0,14%  fd       fd                [.] fstatat                                                       ▒
+   17,75%     0,46%  fd       fd                [.] core::ops::function::FnOnce::call_once{{vtable-shim}}         ▒
+   17,73%     3,91%  fd       fd                [.] ignore::walk::Worker::run_one                                 ▒
+   17,64%     0,00%  fd       fd                [.] start                                                         ▒
+   17,64%     0,00%  fd       fd                [.] std::sys::unix::thread::Thread::new::thread_start             ▒
+   17,64%     0,00%  fd       fd                [.] <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call▒
+   17,64%     0,00%  fd       fd                [.] core::ops::function::FnOnce::call_once{{vtable-shim}}         ▒
+   17,64%     0,00%  fd       fd                [.] std::sys_common::backtrace::__rust_begin_short_backtrace      ▒
+   16,47%    16,37%  fd       fd                [.] __bin_chunk                                                   ▒
+   11,42%     0,21%  fd       fd                [.] readdir                                                       ▒
+    9,28%     0,03%  fd       fd                [.] sccp                                                          ▒
+    6,35%     0,32%  fd       fd                [.] ignore::gitignore::Gitignore::matched_stripped                ▒
+    6,28%     6,28%  fd       fd                [.] malloc                                                        ▒

Given the general theme here, that just looks like the present malloc issue can get even worse; so if we can fix that, I'm reasonably confident the issue would resolve itself.

How would we go about this? I can certainly imagine that the recursive directory walker is quite heavy on allocations (readdir calls). I was actually thinking about this in the past. It would be a nice experiment to try out a custom allocator that is optimized for the directory traversal (i.e. for struct dirent allocations?).

We should probably find some way to count/measure all kinds of allocations (heaptrack?).

A quick fix would be to enable jemalloc for musl builds. For some reason, I deactivated them originally (#481), but we could try to re-activate this:

fd/src/main.rs

Lines 35 to 44 in 6a3aae5

// We use jemalloc for performance reasons, see https://github.com/sharkdp/fd/pull/481
// FIXME: re-enable jemalloc on macOS, see comment in Cargo.toml file for more infos
#[cfg(all(
not(windows),
not(target_os = "android"),
not(target_os = "macos"),
not(target_env = "musl")
))]
#[global_allocator]
static ALLOC: jemallocator::Jemalloc = jemallocator::Jemalloc;

@sharkdp
Copy link
Owner

sharkdp commented Jan 3, 2021

Indeed: most of the allocations are caused by opendir:

image

@ericonr
Copy link
Author

ericonr commented Jan 3, 2021

I actually don't :-o. I don't have musl installed. It seems to come with the Rust toolchain? How did you even manage to build a dynamically linked musl version of fd? (rust-lang/rust#34987)

Ah, that'd make sense. I have no idea what Rust is currently shipping, then; they might be stuck with older musl temporarily, since musl 1.2.x changed time_t to a 64-bit integer for all platforms (including 32-bit ones). Distros like Alpine and Void Linux patch the system rustc to use dynamic linking on musl targets (Alpine defines an alpine target, Void just changes the default for the normal musl target).

It would be a nice experiment to try out a custom allocator that is optimized for the directory traversal (i.e. for struct dirent allocations?).

Not sure that's possible; the allocation happens for the DIR struct, which is completely opaque. The struct in turn contains a buffer that stores the struct dirent read and returns them with readdir.

A quick fix would be to enable jemalloc for musl builds. For some reason, I deactivated them originally (#481)

I remember reading reports about jemalloc + musl being broken for Rust (or even in general), so this might not work at all and/or be unstable in some way. I don't know the current state.

@richfelker
Copy link

It's not plausible that allocations by opendir are a performance bottleneck. Each opendir is an open syscall which is multiple orders of magnitude more costly than malloc. If a large performance of overall time is being spent in malloc contention, the calls are from elsewhere.

Moreover, readdir does not allocate anything; the object returned is part of the open DIR object and its lifetime (from the perspective of the caller, per the spec) ends the next time readdir is called (or closedir) on the same DIR object. But perhaps Rust's POSIX glue shims readdir with an allocation for memory-safety reasons..?

@ericonr
Copy link
Author

ericonr commented Jan 3, 2021

I'm not sure on the allocation, but it does seem to use a function that will usually require locking: https://github.com/rust-lang/rust/blob/ddafcc0b66e04ac4793361d3bd604c8c8a9d16a8/library/std/src/sys/unix/fs.rs#L493

I will see if I can get a backtrace on the lock, since it might not be due to malloc after all...

Or maybe this (from right above it)

            let mut ret = DirEntry { entry: mem::zeroed(), dir: Arc::clone(&self.inner) };
            let mut entry_ptr = ptr::null_mut();

does allocations?

@richfelker
Copy link

I don't actually know Rust, but I think the quoted code necessitates allocation. It should probably be reusing internal storage or if there's an API contract to keep the object valid after the next ReadDir call on the same stream, allocating arrays of them in advance and returning a member of the array.

@Hello71
Copy link

Hello71 commented Jan 3, 2021

I don't actually know Rust, but I think the quoted code necessitates allocation. It should probably be reusing internal storage or if there's an API contract to keep the object valid after the next ReadDir call on the same stream, allocating arrays of them in advance and returning a member of the array.

although I am not overly familiar with Rust, the impl Iterator for ReadDir does not appear to require allocation. all of the allocations are on the stack or RVO. furthermore, the posted screenshot claims that the allocations are made in __alloc_dir in libc, not from any Rust code. actually, this screenshot does not appear to support the claim that excessive allocations are slowing down the program: 23848 allocations is not a substantial amount, assuming that 23848 directories are being scanned.

@richfelker
Copy link

That raises another issue, that Rust's API seems to impose an assumption that the platform has a NAME_MAX...

@Hello71
Copy link

Hello71 commented Jan 4, 2021

this is only on non-solaris/fuchsia/redox/illumos. on those platforms, the name is copied into a heap-allocated buffer:

https://github.com/rust-lang/rust/blob/ddafcc0b66e04ac4793361d3bd604c8c8a9d16a8/library/std/src/sys/unix/fs.rs#L467

@sharkdp
Copy link
Owner

sharkdp commented Jan 4, 2021

Or maybe this (from right above it)

let mut ret = DirEntry { entry: mem::zeroed(), dir: Arc::clone(&self.inner) };
let mut entry_ptr = ptr::null_mut();

does allocations?

I don't actually know Rust, but I think the quoted code necessitates allocation.

I don't think so. let mut ret = DirEntry { … } creates a new DirEntry object on the stack. The dir field is a Arc<InnerReadDir>, i.e. a shared pointer to InnerReadDir. Calling Arc::clone(&self.inner) clones the shared pointer (and increases the refcount), but does not perform an allocation.

It's not plausible that allocations by opendir are a performance bottleneck. Each opendir is an open syscall which is multiple orders of magnitude more costly than malloc. If a large performance of overall time is being spent in malloc contention, the calls are from elsewhere.

actually, this screenshot does not appear to support the claim that excessive allocations are slowing down the program: 23848 allocations is not a substantial amount, assuming that 23848 directories are being scanned.

Exactly 23848 directories are being scanned, yes. No additional allocations seem to happen.

Note that the heaptrack run was performed on the GNU libc version (as I don't have a dynamically linked musl version of fd). From the stacktrace, I can see that __alloc_dir it is called from __opendir_tail. From looking at the source, this is called from opendir and opendirat.

I don't actually know Rust, but I think the quoted code necessitates allocation. It should probably be reusing internal storage or if there's an API contract to keep the object valid after the next ReadDir call on the same stream, allocating arrays of them in advance and returning a member of the array.

I will see if I can get a backtrace on the lock, since it might not be due to malloc after all...

That would be great. But as you mentioned, your LD_PRELOAD experiment with jemalloc seems to indicate that this is caused by malloc.

@ericonr
Copy link
Author

ericonr commented Jan 4, 2021

ut-j1
out-j4
out-j8
out-j12

I had forgotten to grab proper debug info. With it available, I can better show what's going on. The flame graphs above are, respectively, the 1, 4, 8 and 12 thread runs.

From a 12 thread run, the main culprits for calling __lock seem to be:

Samples: 139K of event 'cycles:u', Event count (approx.): 40416534270
  Children      Self  Command  Shared Object     Symbol
-   46.41%    45.97%  fd       libc.so           [.] __lock                                   ◆
   - 14.00% 0xffffffffffffffff                                                                ▒
      - 6.46% std::path::PathBuf::_push                                                       ▒
         - realloc                                                                            ▒
              5.84% __lock                                                                    ▒
            - 0.62% malloc                                                                    ▒
                 __lock                                                                       ▒
      - 3.11% regex::dfa::Fsm::cached_state                                                   ▒
           2.82% __lock                                                                       ▒
      - 1.43% ignore::gitignore::GitignoreBuilder::build                                      ▒
           1.30% __lock                                                                       ▒
      - 1.13% <alloc::vec::Vec<T> as core::clone::Clone>::clone                               ▒
           1.03% __lock                                                                       ▒
      - 1.06% <thread_local::ThreadLocal<T> as core::default::Default>::default               ▒
           0.96% __lock                                                                       ▒
   - 3.07% 0x7f905a2d35ef                                                                     ▒
      - std::ffi::os_str::OsStr::to_os_string                                                 ▒
           2.73% __lock                                                                       ▒
     1.60% 0x8                                                                                ▒
   - 1.39% 0x682e                                                                             ▒
      - std::ffi::os_str::OsStr::to_os_string                                                 ▒
           1.23% __lock                                                                       ▒
   - 0.85% 0xf423f                                                                            ▒
        0x7f905af91d3f                                                                        ▒
   - 0.66% 0x94939291908f8e8c                                                                 ▒
      - globset::glob::char_to_escaped_literal                                                ▒
           0.59% __lock                                                                       ▒
   - 0.58% 0x7                                                                                ▒
        0.52% std::ffi::os_str::OsStr::to_os_string                                           ▒
     0.53% 0x7f905afdabbf                                                                   

Using perf report --call-graph=callee, I get:

  Children      Self  Command  Shared Object     Symbol
-   46.41%    45.97%  fd       libc.so           [.] __lock                                   ◆
   - __lock                                                                                   ▒
      - 9.61% std::ffi::os_str::OsStr::to_os_string                                           ▒
           2.73% 0x7f905a2d35ef                                                               ▒
           1.23% 0x682e                                                                       ▒
      - 9.35% realloc                                                                         ▒
         - 8.91% std::path::PathBuf::_push                                                    ▒
              5.84% 0xffffffffffffffff                                                        ▒
      - 4.37% malloc                                                                          ▒
           1.28% std::ffi::os_str::OsStr::to_os_string                                        ▒
         - 0.92% realloc                                                                      ▒
            - 0.88% std::path::PathBuf::_push                                                 ▒
                 0.68% 0xffffffffffffffff                                                     ▒
      - 3.86% <thread_local::ThreadLocal<T> as core::default::Default>::default               ▒
           0.96% 0xffffffffffffffff                                                           ▒
      - 3.02% regex::dfa::Fsm::cached_state                                                   ▒
           2.82% 0xffffffffffffffff                                                           ▒
      - 2.73% calloc                                                                          ▒
           2.31% globset::GlobSet::matches_candidate_into                                     ▒
      - 2.07% globset::glob::char_to_escaped_literal                                          ▒
           0.59% 0x94939291908f8e8c                                                           ▒
        1.89% <&[u8] as std::ffi::c_str::CString::new::SpecIntoVec>::into_vec                 ▒
      - 1.41% <alloc::vec::Vec<T> as core::clone::Clone>::clone                               ▒
           1.03% 0xffffffffffffffff                                                           ▒
      - 1.30% ignore::gitignore::GitignoreBuilder::build                                      ▒
           1.30% 0xffffffffffffffff                                                           ▒
      - 1.18% ignore::walk::Worker::run_one                                                   ▒
           0.81% 0x7f905af91d3f                                                               ▒
        0.77% free                                                                            ▒
        0.62% regex::re_unicode::escape                                                       ▒
+   20.38%     0.00%  fd       [unknown]         [.] 0xffffffffffffffff                       ▒
+   19.22%     3.77%  fd       fd                [.] ignore::walk::Worker::run_one            ▒
-   17.98%     1.04%  fd       libc.so           [.] realloc                                  ▒
   - realloc                                                                                  ▒
      - 17.05% std::path::PathBuf::_push                                                      ▒
           8.19% 0xffffffffffffffff                                                           ▒
           0.76% 0xc                                                                          ▒
+   17.94%     0.34%  fd       fd                [.] std::path::PathBuf::_push                ▒
-   16.58%     0.26%  fd       fd                [.] core::ops::function::FnOnce::call_once{{v▒
   - core::ops::function::FnOnce::call_once{{vtable-shim}}                                    ▒
      - 16.33% std::sys_common::backtrace::__rust_begin_short_backtrace                       ▒
           core::ops::function::FnOnce::call_once{{vtable-shim}}                              ▒
           std::sys::unix::thread::Thread::new::thread_start                                  ▒
           start                                                                              ▒
           __clone                                                                           

@sharkdp sharkdp added performance and removed bug labels Feb 15, 2021
@ghost
Copy link

ghost commented Mar 5, 2021

strace -c fd -uuu . / > /dev/null on my system reports the futex syscall using 99% of (system) time.
Would it be any use to have a single producer multi consumer queue sending back the PathBuf to each thread for reuse?

@sharkdp
Copy link
Owner

sharkdp commented Mar 14, 2021

Would it be any use to have a single producer multi consumer queue sending back the PathBuf to each thread for reuse?

Could you please go into a bit more detail?

@ericonr
Copy link
Author

ericonr commented Dec 3, 2021

I'm assuming it was caused by #863, the 8.3.0 release seems to have improved performance on musl. For a search with many results, where musl was taking about ~2x the time glibc was, musl is now taking approximately ~1.25x the time.

Unfortunately, when there aren't result matches, which means no printing, musl is still taking ~3x the time that glibc is.

@sharkdp
Copy link
Owner

sharkdp commented Dec 3, 2021

@tavianator FYI

@tavianator
Copy link
Collaborator

Well, I can reproduce this:

Benchmark 1: ./fd-glibc -j24 . ~/code/linux
  Time (mean ± σ):     127.2 ms ±   2.8 ms    [User: 809.6 ms, System: 1858.4 ms]
  Range (min … max):   120.2 ms … 133.0 ms    23 runs
 
Benchmark 2: ./fd-musl -j24 . ~/code/linux
  Time (mean ± σ):      1.047 s ±  0.020 s    [User: 5.097 s, System: 14.196 s]
  Range (min … max):    1.011 s …  1.063 s    10 runs
 
Summary
  './fd-glibc -j24 . ~/code/linux' ran
    8.23 ± 0.24 times faster than './fd-musl -j24 . ~/code/linux'

There is a lot of allocation contention as this bug report suggests. I don't think this can be easily worked around in fd, the musl malloc() implementation just doesn't really scale.

Well, I can think of one workaround:

diff --git a/Cargo.toml b/Cargo.toml
index 203b98e..d8864f0 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -64,7 +64,7 @@ libc = "0.2"
 # FIXME: Re-enable jemalloc on macOS
 # jemalloc is currently disabled on macOS due to a bug in jemalloc in combination with macOS
 # Catalina. See https://github.com/sharkdp/fd/issues/498 for details.
-[target.'cfg(all(not(windows), not(target_os = "android"), not(target_os = "macos"), not(target_os = "freebsd"), not(target_env = "musl")))'.dependencies]
+[target.'cfg(all(not(windows), not(target_os = "android"), not(target_os = "macos"), not(target_os = "freebsd")))'.dependencies]
 jemallocator = "0.3.0"
 
 [dev-dependencies]
diff --git a/src/main.rs b/src/main.rs
index 4f4ab14..371a1ce 100644
--- a/src/main.rs
+++ b/src/main.rs
@@ -39,7 +39,6 @@ use crate::regex_helper::{pattern_has_uppercase_char, pattern_matches_strings_wi
     not(target_os = "android"),
     not(target_os = "macos"),
     not(target_os = "freebsd"),
-    not(target_env = "musl")
 ))]
 #[global_allocator]
 static ALLOC: jemallocator::Jemalloc = jemallocator::Jemalloc;

Much better:

Benchmark 1: ./fd-glibc -j24 . ~/code/linux
  Time (mean ± σ):     126.5 ms ±   3.4 ms    [User: 808.9 ms, System: 1836.1 ms]
  Range (min … max):   120.9 ms … 131.6 ms    22 runs
 
Benchmark 2: ./fd-musl -j24 . ~/code/linux
  Time (mean ± σ):     135.7 ms ±   2.5 ms    [User: 840.1 ms, System: 2140.8 ms]
  Range (min … max):   129.8 ms … 141.8 ms    21 runs
 
Summary
  './fd-glibc -j24 . ~/code/linux' ran
    1.07 ± 0.03 times faster than './fd-musl -j24 . ~/code/linux

@richfelker
Copy link

I don't know what to add at this point except that, if malloc is such a bottleneck here, you'd get much better performance on glibc too by eliminating the gratuitous malloc/free cycles...

@ericonr
Copy link
Author

ericonr commented Dec 3, 2021

I might be mistaken, but it is my understanding that all Rust APIs that end up making a syscall, because they receive a OsStr, will end up allocating in order to generate a CStr to pass to the libc functions. Idk if removing that occurrence of allocations would help these results much.

@tavianator
Copy link
Collaborator

@richfelker That's probably true but likely impractical for this project, as the paths and whatever else are dynamically allocated by other libraries that we don't control.

If I'm not mistaken, the current implementation of mallocng takes an exclusive lock even for reads. Is it easy to convert it to a real rwlock? That might help a bit.

@richfelker
Copy link

richfelker commented Dec 4, 2021

The design facilitates rwlock but was found not to be of any benefit, because the critical section is tiny and the cost is the memory-synchronizing operation itself, not exclusion due to the lock. As far as I can tell this is fundamental to global consistency, not any deficiency of implementation. It's vaguely possible that a more specialized rwlock (vs the general purpose pthread primitive) could do slightly better, but based on everything I've seen so far, I doubt it.

@thestinger
Copy link

Traditional thread caching is incompatible with the consistency checking in hardened_malloc and musl's malloc-ng. They could use batched allocation for malloc by having a thread-local array for the smallest size classes with a small number (4-16) of allocations to amortize the cost of locking for malloc but they can't put allocations back into it on free like glibc and jemalloc. That means there's inherently going to be the cost of locking/unlocking for each call to malloc (at least without batching) and free. Even an uncontended lock is expensive. hardened_malloc may add optional batched allocation support.

It's worth noting that despite being a hardened allocator focused on security properties, hardened_malloc can be used in a lightweight configuration with comparable performance to glibc malloc and jemalloc with their thread caches disabled. If you disable all the optional security features other than slab canaries, it still provides better security properties than those in every way. There's an inherently high cost to most of the other optional features that are enabled by default (zero on free, write after free check, frequent guard slabs and especially the slab quarantines) and the optional slot randomization currently has a fair bit of overhead but we expect that we can optimize it to be quite cheap when built specifically for modern x86_64 CPUs via the PDEP feature.

hardened_malloc supports jemalloc-style arenas so it can scale well to multiple cores. It has round-robin assignment of threads to arenas similar to jemalloc. The non-Android configuration uses 4 arenas by default but we use 1 arena on Android by default. Enabling more arenas makes it scale better to more threads but the security-focused approach in hardened_malloc needs to reserve a massive amount of address space for each arena. The typical 47-bit userspace address space (48 total) on x86_64 isn't enough for a lot more than 4 arenas. Linux on mobile arm64 typically has only a 38-bit userspace address space (39 total) due to using 3-level page tables as an optimization, but it can be configured to have 47-bit like x86_64.

The arena approach fits nicely into hardened_malloc but it doesn't really play well with the optional slab quarantine feature. Arenas multiply the worst case cached / quarantined memory of the allocator since they're separate instances with their own cached/quarantined memory. The quarantine feature purposely holds onto freed allocations and can't easily purge any memory for them. Multiplying that for each arena makes it much worse. We may end up making the quarantines much smaller by default and not quarantining the largest slab allocation sizes by default but until then it's a strong pressure against using many arenas. We use only 1 arena on GrapheneOS since we use both of the slab quarantine features. Making the slab quarantine features lighter is an open problem but there's unlikely to be any substantial improvement beyond choosing a different security vs. memory usage balance for the default configuration which is more slanted towards lighter memory usage.

OpenBSD malloc also now supports a similar arena approach, although it doesn't require lots of address space since they don't provide strict isolated regions and use a hash table to find the metadata for a slab or large allocation rather than addresses. OpenBSD malloc is far slower than hardened_malloc though. The whole point of hardened_malloc was making a replacement for OpenBSD malloc on 64-bit with better performance and also security via isolated regions without address space reused between them and making the optional security features fast enough to have them all enabled by default.

musl could use arenas via inline metadata specifying the arena (which would end up being verified through the other checks) but multiplying the cached memory goes against the lightweight memory approach. It could be an option though and wouldn't be that complicated. It would probably only be around 50 lines of code. I don't really see that happening since musl is focused on being lightweight on memory usage and simplicity. Perhaps it's worth trying it to see how much benefit it would offer though. It's also not quite as bad as multiplying the worst case cached memory in practice as long as threads are statically assigned to arenas. It depends on the workload.

@thestinger
Copy link

This was the initial arena implementation for hardened_malloc (with w=1 as a URL parameter to hide whitespace-only changes):

GrapheneOS/hardened_malloc@c5e9114?w=1

It was subsequently switched to using round-robin static assignment of threads to arenas instead of the placeholder random static assignment of threads to arenas and also micro-optimized. It hasn't changed significantly from the initial implementation in terms of approach. The difference for malloc-ng would be needing to find the arena from inline metadata instead of address range. It's really fairly trivial to implement. I just don't knows how well it fits with what @richfelker is trying to for musl's malloc-ng since it raises memory usage. It would be pretty lightweight if it was only 2-4 arenas though. In jemalloc, they make 4 arenas for every logical thread which makes it quite heavyweight, and in hardened_malloc it's slab quarantines that make it heavyweight when those are enabled, since they're heavyweight themselves.

@thestinger
Copy link

Also worth noting: hardened_malloc has entirely separate allocators for each small (slab) allocation size class within each arena with separate locks. It has global locking for large allocations and then a lock for each per-size-class-per-arena allocator. There are no other locks. It has no shared state since it doesn't share anything across size classes for security reasons as part of that isolated size class design. I don't think that per-size-class locking reduces contention that much though. We wouldn't have gone out of the way to do it. It makes sense because the ONLY cost for us to provide it is needing a ChaCha8 CSPRNG state struct for each size class allocator rather than one per arena. It doesn't actually increase memory usage due to how metadata is split up anyway. It's not like it matters in a typical security-focused configuration using slab quarantines since those are so heavy...

tavianator added a commit to tavianator/fd that referenced this issue Jul 13, 2022
tavianator added a commit to tavianator/fd that referenced this issue Sep 16, 2022
sharkdp pushed a commit that referenced this issue Sep 19, 2022
@jirutka
Copy link

jirutka commented Nov 19, 2022

Could someone please benchmark fd built with mimalloc instead of jemalloc?

@sharkdp
Copy link
Owner

sharkdp commented Nov 21, 2022

Should be easy to do using https://docs.rs/mimalloc/latest/mimalloc/. We have a benchmark suite here: https://github.com/sharkdp/fd-benchmarks (see regression.sh)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants