New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend and finish automated benchmark scripts #131

Closed
sahib opened this Issue May 11, 2015 · 11 comments

Comments

Projects
None yet
3 participants
@sahib
Owner

sahib commented May 11, 2015

Benchmark scripts are at:

tests/test_speed/*

Current State

  • benchmark.py builds and installs all competitors in /tmp.
  • Every software is run x times with flushed caches and one additional time with
    flushed caches. The software is run on generated or real datasets.
  • A benchmark.json file is written containing all relevant measurements.
  • plot_benchmark.py creates a nice svg bar plot out of the json file.

Done:

  • Move install scripts into separate bash scripts.

ToDo:

  • Write docs/benchmark.rst -- done.
  • Generate some more realistic datasets. Especially some datasets where rmlint
    shines (directory trees on different disks e.g.).
  • Provide baseline timing (time it takes to read all found duplicates from
    disk).
  • Extract the number of files the tool found. If numbers are not matching up
    give a warning.
  • Write a merge script that combines several benchmark.json files and plots
    a line plot on how every tool performed on each individual dataset.

@sahib sahib added this to the 2.3.0 milestone May 11, 2015

@sahib sahib added the ready label Aug 24, 2015

@sahib

This comment has been minimized.

Owner

sahib commented Sep 14, 2015

Mostly done. What's new:

  • Both benchmark.py and plot_benchmark.py understand a rich set of options.
  • Installation scripts are (dumb) little bash scripts now.
  • Added a baseline program (a simple python dupe finder) that servers as slowest competitor.
  • Parse number of found dupes and sets for each program (numbers tend to not match well up between tools).
  • Peak memory usage is also included, but not for each run.
  • Plot by timing, cpu usage and dupes found.
  • Results are written to timestamped directories to avoid confusion.

Sample usage:

$ sudo python benchmark.py -r -d /usr -d /tmp -p dupd  # produces directory with json files.
$ python plot_benchmark.py bench-output-2015-09-14T22:12:55+0200 # produces dir with svgs.
$ chromium plot-output-2015-09-14T22:12:55+0200/*.svg

sudo is still needed in order to flush the filesystem cache.

@sahib sahib added In Progress and removed ready labels Sep 14, 2015

@sahib

This comment has been minimized.

Owner

sahib commented Sep 15, 2015

Here are some first "results":

http://rmlint.readthedocs.org/en/latest/benchmarks.html

It's not perfect yet, but it shows the direction. Click on the images for tooltips and hover effects.

@0ion9

This comment has been minimized.

0ion9 commented Sep 16, 2015

Looking at the results page, I think that rmlint-spooky is getting the wrong color. In the graph, there is a dark grey bar where I would expect spooky to show, but in the legend it is shown as an orangeish color very similar to baseline.py. To me it looks like either the legend or the graph is wrong.

@SeeSpotRun

This comment has been minimized.

Contributor

SeeSpotRun commented Sep 16, 2015

Personally I think the runs apart from the first are pretty meaningless. In the real world, the user should only need to run the tool once, and it's unlikely that many of the files are cached before they start.

Tests such as http://www.virkki.com/jyri/articles/index.php/duplicate-file-detection-performance/ are pretty meaningless ("first I ran it once and ignored the time, just to populate file caches"). With that said, dupd is impressively fast on cached comparisons.

Rmlint's file reading strategy deliberately doesn't try to read 2 files from the same disk at the same time, in order to reduce disk seeks / thrash. I think this probably penalises rmlint on runs where file data is already cached.

@sahib

This comment has been minimized.

Owner

sahib commented Sep 16, 2015

@0ion9: Yup, that seems to be a rendering bug in pygal, it works fine with the default style applied.

@SeeSpotRun: That's exactly what missing. Some text that tries to explain the results. But for that one might need the full picture, even if it's just there to say "As you can see...".

It should be also noted that rmlint is capable of using --replay in the second run and up, which makes re-running mostly neglectable.

A slightly weird result is also that -a spooky is not really useful, but that might be partly due my amd64 cpu. But using another cheap hashfunction for -PP might be a good idea.

@SeeSpotRun

This comment has been minimized.

Contributor

SeeSpotRun commented Sep 16, 2015

SeeSpotRun@a40f970 should give a bit of a speedup. I had a sort() call in hasher.c which was supposed to ensure the hasher gets the least busy hashing thread, but it turns out the sort time was costing more than the gain.

@sahib

This comment has been minimized.

Owner

sahib commented Sep 16, 2015

Cool, that chopped off a few seconds on the re-runs.
New plots will follow in a few days, will collect some other changes first.

@sahib

This comment has been minimized.

Owner

sahib commented Sep 18, 2015

I added some text to explain some aspects of the plots.
Maybe someone can validate it, it was late and beer was involved

@SeeSpotRun

This comment has been minimized.

Contributor

SeeSpotRun commented Sep 26, 2015

I've started building a synthetic set of test files to help reconcile the number of duplicates found. By having 1/2/4/8/16 etc sets of duplicates of different types we can quickly infer what each duplicate finder is/isn't finding. I'm trying to cover all the thinks that a duplicate finder might miss (intentionally or due to a bug):

  • Duplicate files >4GB (max 32 bit unsigned int)
  • Duplicate files >2GB (max 32 bit signed int)
  • Hidden files
  • Files in hidden folders
  • Symlinked files
  • Files in symlinked folders
  • Hardlinked files
  • Files that you have to cross a filesystem boundary to reach
  • Files with weird characters in their name
  • "Normal" files

Edit:

  • Zero-byte files

It also covers potential false positives:

  • Files >4GB that only differ in the last byte
  • Files >2GB that only differ in the last byte
  • Bind-mounts such that the same file appears under two paths (but is not a duplicate)

Edit:

  • Hash collisions for common checksum types

... looking for suggestions on what else to include.

@sahib

This comment has been minimized.

Owner

sahib commented Sep 26, 2015

  • Files with spaces, tabs, newlines, simple quotes, double quotes and commas (you know why 😏) in them (@ weird names).
  • Files with only 1 byte that are different (off by one errors).

Will you provide a set of sh files to generate those datasets? (some may be harder like filesystem boundaries)

@SeeSpotRun

This comment has been minimized.

Contributor

SeeSpotRun commented Sep 26, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment