Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

excessive memory usage on huge trees #918

Closed
cehteh opened this issue Dec 29, 2021 · 17 comments
Closed

excessive memory usage on huge trees #918

cehteh opened this issue Dec 29, 2021 · 17 comments

Comments

@cehteh
Copy link

cehteh commented Dec 29, 2021

When traversing some huge directory (5TB data/ 200Million Entries) i noticed that 'fd' will use excessive amounts of memory (I killed it at 35GB RSS). Some investigation shown that 'fd | wc -l' or 'fd >/dev/null' does not have this problem. While 'fd | less' again shows the same problem.

So my guess is that fd uses some unbounded queues to send the output which pile up because the terminal emulator is too slow to print 200 Million entries in time.

@tmccombs
Copy link
Collaborator

Which version of fd are you using?

@cehteh
Copy link
Author

cehteh commented Dec 30, 2021

$ fd --version
fd 8.3.0

installed by cargo install

@tmccombs
Copy link
Collaborator

tmccombs commented Jan 4, 2022

Ah, I thought that we had set a limit on the channel size in #885, but it looks like that part got removed. So, yes, there is an unbounded queue of files and directories to process, and that could cause high memory usage if the terminal can't keep up. Maybe we should revisit adding a bound to that. I believe with the std mpsc crate there is a performance cost to doing that, I'm not sure about crossbeam.

@sharkdp
Copy link
Owner

sharkdp commented Jan 23, 2022

but it looks like that part got removed

Right. I removed it because you wrote "I switch back to using channel instead of sync_channel than it is, at least not any slower". And I believe I was able to confirm this in my own benchmarks.

Maybe we should revisit adding a bound to that. I believe with the std mpsc crate there is a performance cost to doing that, I'm not sure about crossbeam.

Yes, let's do this in a dedicated PR with new benchmark results (and memory profiles). A quick fix could be to use a much higher limit than the one suggested in #885.

@cehteh
Copy link
Author

cehteh commented Feb 22, 2022

hint: in my own directory traversal code i found that about 4k per thread (for memory calculation, not one channel per thread) is a somewhat sweet spot. Actually the limit isn't terribly important as long it doesn't explode, even with 100MB limit it should be okish.

@mwaitzman
Copy link

Is this the cause of fd outputting "Killed" and then quitting when I search for all files with a specific extension from the root (Arch Linux, ~9/~14TB used space to search through across 5 mounted partitions)?

@tavianator
Copy link
Collaborator

@mwaitzman It might be, do you see anything about the OOM killer in dmesg?

@mwaitzman
Copy link

Yes, I see when running a similar command

[ 4279.148487] Out of memory: Killed process 7330 (fd) total-vm:61310736kB, anon-rss:24569660kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:111272kB oom_score_adj:0
[ 4285.522887] oom_reaper: reaped process 7330 (fd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

@tavianator
Copy link
Collaborator

Okay yeah, I think this is probably due to the lack of backpressure with the unbounded channel. It's unfortunate that std::sync::mpsc::sync_channel() (the bounded queue) performs so much worse than channel() (unbounded).

The SPSC case (-j1) is 10-40% worse with bounded queues, and the performance is worse for higher capacities, so tuning the capacity is crucial. With a capacity of 4096 * threads, bounded queues match unbounded queues at -j8 but not with fewer threads.

I'd like to figure out #933.

@cehteh
Copy link
Author

cehteh commented Jun 3, 2022 via email

@Architector4
Copy link

Architector4 commented Oct 22, 2022

Experiencing the exact same issue, but with --exec. I'm trying to run fd "png$" / --exec optipng -o 7 (I know that this usage of optipng may break stuff, I know what I'm doing).

I've tried leaving different variations running on my laptop for the night, finding it absolutely frozen every time i wake up, arriving at fd "png$" --threads=1 --exec sh -c 'optipng -o 7 "{}"; sync; sleep 1m' to try to minimize impact. At this point I've checked with a task manager, and it's indeed fd ballooning.

Currently on fd 8.4.0 from Arch Linux repositories. Is there anything I can do on my end to do what I want, besides resorting to find?

@tmccombs
Copy link
Collaborator

@Architector4 you might have some luck filtering out some directories. It looks like you are running on /, so excluding /proc, /sys, /tmp, etc. Would likely make a difference (if you are on mac those paths might be different). If the files you care about are all on the same filesystem you could try the --one-file-system option.

@Architector4
Copy link

Architector4 commented Oct 23, 2022

@tmccombs I don't think it would make any meaningful difference in my case -- I'm looking for all files that end with "png", of which there would not be any in such special directories. I imagine (I hope?) that fd does not put files not matching the search terms into that ballooning cache.

I'd also prefer to go through all filesystems (my SSD mounted on / and my HDD mounted in /media), so I don't want to use that option either. But yeah, those are good ways to help alleviate the problem for other cases where applicable.

In my case I decided to resort to find and have had good success with it. This is what I ended up running to achieve the same task without running out of memory, with GNU find and parallel:

find / -iname "*.png" -print0 | parallel -0 optipng -o 7 {}

@tmccombs
Copy link
Collaborator

Oh, I think you're right. In your case probably what is happening is the queues are filling up because running optipng is blocking processing additional entries, and there isn't any backpressure to stop new entries added to the queue. If you were to use fd piped to parellel, my guess is it would probably work better. What we really need here is a bounded queue. The question is how to do that without hurting performance in cases where backpressure isn't needed.

@Architector4
Copy link

Architector4 commented Oct 24, 2022

I feel like the searching thread's performance would be I/O bound most of the time (input being filesystem traversal and output being to --exec or stdout or whatnot), and hence some simple "if over 10000 items in queue then block adding to it until the next item is consumed" kind of a statement slapped right before the "add to queue" part would probably not introduce too much of performance drawback.

In any case, this feels like a necessary feature for wellbeing of the host machine, because it seems like this excessive memory usage applies to any general case where such output is slower than such input. To demonstrate, I have a silly script, named delaystream.sh:

while read -r line; do
	echo "$line"
	sleep 1
done

This script prints each line it receives on stdin to stdout, but waits 1 second between them, causing artificially slow I/O. So I piped fd to it:

fd / | delaystream.sh

This reliably makes fd balloon in memory usage.

I argue some kind of a limit is absolutely necessary here, and is more important than the performance from not having to account for it.

Maybe it would be possible to introduce a command line switch that disables the limit for those who wish to have the performance benefits and know ballooning wouldn't happen?

@tavianator
Copy link
Collaborator

I feel like the searching thread's performance would be I/O bound most of the time (input being filesystem traversal and output being to --exec or stdout or whatnot), and hence some simple "if over 10000 items in queue then block adding to it until the next item is consumed" kind of a statement slapped right before the "add to queue" part would probably not introduce too much of performance drawback.

Ideally yes, but: #918 (comment)

In any case, this feels like a necessary feature for wellbeing of the host machine

I agree, we need to do something. I can try switching to crossbeam-channels again.

@Architector4
Copy link

Oh, sorry, missed that.

Would it be possible to slap something like Either<channel, sync_channel> in the code and operate on whichever one it is, and have the user pick between them via commandline switches, and default to the bounded queue for safety, or at least let the user pick the bounded one if they wish? This feels like an acceptable strategy until a better solution is found, and at least makes it possible to use fd in use cases like mine.

tavianator added a commit to tavianator/fd that referenced this issue Oct 24, 2022
tavianator added a commit to tavianator/fd that referenced this issue Nov 1, 2022
tavianator added a commit to tavianator/fd that referenced this issue Nov 1, 2022
@sharkdp sharkdp closed this as completed in 5278405 Nov 1, 2022
tavianator added a commit to tavianator/fd that referenced this issue Oct 30, 2023
We originally switched to bounded channels for backpressure to fix sharkdp#918.
However, bounded channels have a significant initialization overhead as
they pre-allocate a fixed-size buffer for the messages.

This implementation uses a different backpressure strategy: each thread
gets a limited-size pool of WorkerResults.  When the size limit is hit,
the sender thread has to wait for the receiver thread to handle a result
from that pool and recycle it.

Inspired by [snmalloc], results are recycled by sending the boxed result
over a channel back to the thread that allocated it.  By allocating and
freeing each WorkerResult from the same thread, allocator contention is
reduced dramatically.  And since we now pass results by pointer instead
of by value, message passing overhead is reduced as well.

Fixes sharkdp#1408.

[snmalloc]: https://github.com/microsoft/snmalloc
tavianator added a commit to tavianator/fd that referenced this issue Oct 30, 2023
We originally switched to bounded channels for backpressure to fix sharkdp#918.
However, bounded channels have a significant initialization overhead as
they pre-allocate a fixed-size buffer for the messages.

This implementation uses a different backpressure strategy: each thread
gets a limited-size pool of WorkerResults.  When the size limit is hit,
the sender thread has to wait for the receiver thread to handle a result
from that pool and recycle it.

Inspired by [snmalloc], results are recycled by sending the boxed result
over a channel back to the thread that allocated it.  By allocating and
freeing each WorkerResult from the same thread, allocator contention is
reduced dramatically.  And since we now pass results by pointer instead
of by value, message passing overhead is reduced as well.

Fixes sharkdp#1408.

[snmalloc]: https://github.com/microsoft/snmalloc
tavianator added a commit to tavianator/fd that referenced this issue Nov 1, 2023
We originally switched to bounded channels for backpressure to fix sharkdp#918.
However, bounded channels have a significant initialization overhead as
they pre-allocate a fixed-size buffer for the messages.

This implementation uses a different backpressure strategy: each thread
gets a limited-size pool of WorkerResults.  When the size limit is hit,
the sender thread has to wait for the receiver thread to handle a result
from that pool and recycle it.

Inspired by [snmalloc], results are recycled by sending the boxed result
over a channel back to the thread that allocated it.  By allocating and
freeing each WorkerResult from the same thread, allocator contention is
reduced dramatically.  And since we now pass results by pointer instead
of by value, message passing overhead is reduced as well.

Fixes sharkdp#1408.

[snmalloc]: https://github.com/microsoft/snmalloc
tavianator added a commit to tavianator/fd that referenced this issue Nov 2, 2023
We originally switched to bounded channels for backpressure to fix sharkdp#918.
However, bounded channels have a significant initialization overhead as
they pre-allocate a fixed-size buffer for the messages.

This implementation uses a different backpressure strategy: each thread
gets a limited-size pool of WorkerResults.  When the size limit is hit,
the sender thread has to wait for the receiver thread to handle a result
from that pool and recycle it.

Inspired by [snmalloc], results are recycled by sending the boxed result
over a channel back to the thread that allocated it.  By allocating and
freeing each WorkerResult from the same thread, allocator contention is
reduced dramatically.  And since we now pass results by pointer instead
of by value, message passing overhead is reduced as well.

Fixes sharkdp#1408.

[snmalloc]: https://github.com/microsoft/snmalloc
tavianator added a commit to tavianator/fd that referenced this issue Nov 2, 2023
We originally switched to bounded channels for backpressure to fix sharkdp#918.
However, bounded channels have a significant initialization overhead as
they pre-allocate a fixed-size buffer for the messages.

This implementation uses a different backpressure strategy: each thread
gets a limited-size pool of WorkerResults.  When the size limit is hit,
the sender thread has to wait for the receiver thread to handle a result
from that pool and recycle it.

Inspired by [snmalloc], results are recycled by sending the boxed result
over a channel back to the thread that allocated it.  By allocating and
freeing each WorkerResult from the same thread, allocator contention is
reduced dramatically.  And since we now pass results by pointer instead
of by value, message passing overhead is reduced as well.

Fixes sharkdp#1408.

[snmalloc]: https://github.com/microsoft/snmalloc
tavianator added a commit to tavianator/fd that referenced this issue Nov 2, 2023
We originally switched to bounded channels for backpressure to fix sharkdp#918.
However, bounded channels have a significant initialization overhead as
they pre-allocate a fixed-size buffer for the messages.

This implementation uses a different backpressure strategy: each thread
gets a limited-size pool of WorkerResults.  When the size limit is hit,
the sender thread has to wait for the receiver thread to handle a result
from that pool and recycle it.

Inspired by [snmalloc], results are recycled by sending the boxed result
over a channel back to the thread that allocated it.  By allocating and
freeing each WorkerResult from the same thread, allocator contention is
reduced dramatically.  And since we now pass results by pointer instead
of by value, message passing overhead is reduced as well.

Fixes sharkdp#1408.

[snmalloc]: https://github.com/microsoft/snmalloc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants