Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The benchmarks are misleading #31

Closed
lilianmoraru opened this issue Jun 7, 2017 · 16 comments
Closed

The benchmarks are misleading #31

lilianmoraru opened this issue Jun 7, 2017 · 16 comments

Comments

@lilianmoraru
Copy link

I think that the README should specifically mention that the regex search is faster(that's because the regex is slow in find).
The actual search is slower(it seems to imply that the find in general is faster).
Here is a more realistic usage of find, on an SSD(all runs are on warm cache):

time find -name '*.cpp'
...
real	0m0.447s
user	0m0.044s
sys	0m0.220s

And because fd uses patterns, it would look like this:

time fd '.*\.cpp$'
...
real	0m2.353s
user	0m0.100s
sys	0m1.240s

Weird, I just tested(while writing this) with -iregex and I got this:

time find -iregex '.*\.cpp$'
...
real	0m0.667s
user	0m0.128s
sys	0m0.400s
@BurntSushi
Copy link

Unfortunately, neither your comment nor the README provide a way to easily run and confirm the benchmark for yourself, so it's pretty hard to make any kind of progress.

I'd encourage you to experiment with disabling gitignore support, which I believe is the -I flag.

@lilianmoraru
Copy link
Author

Writing to stdout seems slow:

$ time fd '.*\.cpp$' | wc -l
10863

real	0m0.462s
user	0m0.268s
sys	0m0.180s

--

$ time find -iregex '.*\.cpp$' | wc -l
10863

real	0m0.335s
user	0m0.136s
sys	0m0.192s

@BurntSushi
Copy link

Also, the README's benchmark is clearly running with different flags than what you've provided. :-)

@lilianmoraru
Copy link
Author

I agree but it implies that the search in general is faster(also mentions that it is fair).
I think the arguments are valid for regex search only, which would be ok if mentioned.

@BurntSushi
Copy link

@lilianmoraru Did you experiment with the -I flag? What did you discover?

@lilianmoraru
Copy link
Author

No difference(This source code does not have .gitignore):

$ time fd -I '.*\.cpp$' | wc -l
10863

real	0m0.477s
user	0m0.112s
sys	0m0.384s

@BurntSushi
Copy link

I think @lilianmoraru and @sharkdp need to focus on finding a common set of files that they can each benchmark and verify. There are too many variables at play to immediately blame the regex engine. @lilianmoraru Even in your own benchmark find with -iregex is faster than fd, so clearly, there is more to the story.

@lilianmoraru
Copy link
Author

If I use "-n" it has almost the same performance characteristics as find -iregex.

@sharkdp
Copy link
Owner

sharkdp commented Jun 7, 2017

@lilianmoraru

I think that the README should specifically mention that the regex search is faster(that's because the regex is slow in find).

I think you are right, it seems like the -iregex search in find is at least part of the reason why find was slower in this particular benchmark that I did in my home folder.

The actual search is slower(it seems to imply that the find in general is faster).

I think this will really depend on the specific situtation, as @BurntSushi mentioned:

I think @lilianmoraru and @sharkdp need to focus on finding a common set of files that they can each benchmark and verify.

Absolutely. I honestly did not expect this to become this popular that fast, so the current benchmark was really just a first shot in order for me to get a feeling about the performance.

@lilianmoraru

Writing to stdout seems slow

Yes, please pipe the output to /dev/null like in the README or at least turn on -n/--no-color for fd - otherwise fd might be slowed down by the terminal rendering.

[..] it implies that the search in general is faster(also mentions that it is fair).

The README says: "The given options for fd are needed for a fair comparison". The options are --hidden (search through hidden folders), --no-ignore (do not respect ignore files) and --full-path (search the whole path, not just file- and directory names). I turned these options on in order for a 'fair' comparison because find does all these things by default (full path search only for -iregex). Without these options, fd is much faster:

> time fd '.*[0-9]\.jpg$' > /dev/null
fd '.*[0-9]\.jpg$' > /dev/null  0,33s user 0,22s system 99% cpu 0,555 total

> time find -iregex '.*[0-9]\.jpg$' > /dev/null
find -iregex '.*[0-9]\.jpg$' > /dev/null  4,38s user 0,90s system 99% cpu 5,298 total

Coming back to your original point, you are right in that find seems to be much faster when using -iname instead of -iregex:

> time find -iname '*[0-9].jpg' > /dev/null
find -iname '*[0-9].jpg' > /dev/null  1,78s user 0,93s system 99% cpu 2,715 total

I think the arguments are valid for regex search only, which would be ok if mentioned.

Agreed.

I suggest the following:

  • Specifically mention the -iregex option in the README
  • Work on (several) reproducible benchmarks. Also, do statistics (I've started using bench)
  • Keep improving fd's performance 😃

As a first version of a reproducible benchmark (suggesting that find -iname is slightly faster than fd), clone https://github.com/rust-lang/rust and run:

> bench "fd -HI '\.py$'" "find -iname '*.py'" "find -iregex '.*\.py$'"
benchmarking bench/fd -HI '\.py$'
time                 22.98 ms   (22.63 ms .. 23.21 ms)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 23.40 ms   (23.16 ms .. 23.76 ms)
std dev              655.0 μs   (496.3 μs .. 867.7 μs)

benchmarking bench/find -iname '*.py'
time                 17.78 ms   (17.39 ms .. 18.12 ms)
                     0.996 R²   (0.991 R² .. 0.999 R²)
mean                 18.10 ms   (17.87 ms .. 18.46 ms)
std dev              730.5 μs   (460.3 μs .. 1.151 ms)
variance introduced by outliers: 12% (moderately inflated)

benchmarking bench/find -iregex '.*\.py$'
time                 29.63 ms   (29.19 ms .. 30.04 ms)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 29.52 ms   (29.33 ms .. 29.77 ms)
std dev              448.1 μs   (315.7 μs .. 678.3 μs)

@lilianmoraru
Copy link
Author

Btw, if you want to bench the Rust code(and use the nightly bench - for example, rayon puts the benches in a separate workspace project), you also have this option: https://github.com/BurntSushi/cargo-benchcmp.

Side-note:
I like how you can do this:
For a file StuffAndStuff.txt, you can just write fd and and it will find it, while doing something like find -iregex "And" of course doesn't work...

@BurntSushi
Copy link

Also, consider adding a larger repository to your benchmark. :-) A couple dozen milliseconds is frighteningly fast---probably in "process overhead" territory. (Of course, that is also important to benchmark!)

@lilianmoraru
Copy link
Author

It seems that it is enough to make the regex a bit more complicated and find turns slower:

time fd --hidden --no-ignore --full-path -n hello | wc -l                                                                                                                                       
79945
fd --hidden --no-ignore --full-path -n hello  1,32s user 0,65s system 109% cpu 1,810 total
wc -l  0,16s user 0,05s system 11% cpu 1,810 total

--------

time find -iregex ".*[Hh][Ee][Ll][Ll][Oo].*" | wc -l
79945
find -iregex ".*[Hh][Ee][Ll][Ll][Oo].*"  2,37s user 0,53s system 108% cpu 2,664 total
wc -l  0,01s user 0,00s system 0% cpu 2,664 total

@sharkdp
Copy link
Owner

sharkdp commented Jun 7, 2017

Also, consider adding a larger repository to your benchmark. :-) A couple dozen milliseconds is frighteningly fast---probably in "process overhead" territory. (Of course, that is also important to benchmark!)

Yes, thanks. It looks like those results are similar for larger folders, though (fd being 30%-50% slower than find -iname) -- for this particular search pattern.

@lilianmoraru
Copy link
Author

lilianmoraru commented Jun 7, 2017

@sharkdp Well, I came to the conclusion that it actually performs pretty darn well.
If you disable coloring + .gitignore and run a very basic search(fd vs find -name '*.cpp'), fd is a few tens of milliseconds slower but in the rest of cases, it is faster(again, without coloring, which would be the fair way to compare).
So, it is fast, but not with coloring(actually it is fast with coloring too, you don't wait a lot, but comparing the milliseconds/seconds with find...) and for a very basic search where you can use -name.

sharkdp added a commit that referenced this issue Jun 9, 2017
sharkdp added a commit that referenced this issue Jun 9, 2017
@sharkdp
Copy link
Owner

sharkdp commented Jun 9, 2017

The -iregex flag is now mentioned in the README. I'm going to close this and open a new ticket for reproducible benchmarks.

@sharkdp sharkdp closed this as completed Jun 9, 2017
@sharkdp
Copy link
Owner

sharkdp commented Oct 8, 2017

Just in case anyone comes back to this ticket: This all happened before parallel search was implemented (#41). fd has become much faster since then and is typically also faster than find -iname.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants