Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some simple benchmarking of sourmash gather on GTDB zipfiles/SBTs #1530

Closed
ctb opened this issue May 17, 2021 · 6 comments
Closed

some simple benchmarking of sourmash gather on GTDB zipfiles/SBTs #1530

ctb opened this issue May 17, 2021 · 6 comments

Comments

@ctb
Copy link
Contributor

ctb commented May 17, 2021

While writing a blog post about the sourmash v4.1 release, I got curious about the practical implications of --linear/--no-linear and --prefetch/--no-prefetch, so I ran a following benchmark script and recorded the output. The benchmark script and raw output are at the bottom.

The query signature here was a merge of four signatures that were present in the database, so gather would do four iterations.

Summary:

Zipfile collection

Time (s) Memory (mb)
no-linear/prefetch 207s 81mb
linear/prefetch 205s 81mb
no-linear/no-prefetch 811s 87mb
linear/no-prefetch 802s 86mb

Indexed zipfile (SBT):

Time (s) Memory (mb)
no-linear/prefetch 10s 215mb
linear/prefetch 177s 1502mb
no-linear/no-prefetch 22s 214mb
linear/no-prefetch 187s 1505mb

conclusions

so I think I understand almost everything here, which is good, since I wrote a lot of the code 😆 -

  • for the zipfile collection, four passes were needed if prefetch wasn't used, so the time was 4x for --no-prefetch;
  • for the zipfile collection, --no-linear and linear are identical;
  • for the SBT zip, linear is way slower than using the index, of course!

but the two weird results are for the SBT:

  • linear/no-prefetch and linear/prefetch are almost the same? no-prefetch should require multiple passes... is it maybe that all the signatures are being loaded in the first time, so after the first few queries
  • and why is there so much more memory usage with --linear than with --no-linear?

so my hypothesis (a theory we can test! :dora:) is that the SBT .signatures() method is keeping all the sigs in memory. The puzzling thing is that the memory usage is so high for that - maybe it's keeping the tree in memory, too, or something?

Anyway, the two big conclusions are the obvious ones and also reflect the defaults for sourmash:

  • --no-linear --prefetch is generally best;
  • use --prefetch by default;
  • if you want low memory, use zipfile collections. if you want speed, use an indexed database;

script and raw output

# bench.sh
set -x
set -e
# all four combinations with a zipfile (no index)
/usr/bin/time -v sourmash gather --no-linear --prefetch out.sig gtdb-rs202.genomic-reps.k31.zip
/usr/bin/time -v sourmash gather --linear --prefetch out.sig gtdb-rs202.genomic-reps.k31.zip
/usr/bin/time -v sourmash gather --no-linear --no-prefetch out.sig gtdb-rs202.genomic-reps.k31.zip
/usr/bin/time -v sourmash gather --linear --no-prefetch out.sig gtdb-rs202.genomic-reps.k31.zip

# all four combinations with an SBT (indexed)
/usr/bin/time -v sourmash gather --no-linear --prefetch out.sig gtdb-rs202.genomic-reps.k31.sbt.zip
/usr/bin/time -v sourmash gather --linear --prefetch out.sig gtdb-rs202.genomic-reps.k31.sbt.zip
/usr/bin/time -v sourmash gather --no-linear --no-prefetch out.sig gtdb-rs202.genomic-reps.k31.sbt.zip
/usr/bin/time -v sourmash gather --linear --no-prefetch out.sig gtdb-rs202.genomic-reps.k31.sbt.zip

Raw output attached.

bench.txt

@ctb
Copy link
Contributor Author

ctb commented May 18, 2021

on the full SBT:

% /usr/bin/time -v sourmash gather --no-linear --prefetch out.sig gtdb-rs202.genomic.k31.sbt.zip
       User time (seconds): 63.27
        System time (seconds): 15.34
        Percent of CPU this job got: 20%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 6:14.66
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1020764
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 21
        Minor (reclaiming a frame) page faults: 614443
        Voluntary context switches: 45296
        Involuntary context switches: 735935
        Swaps: 0
        File system inputs: 5477584
        File system outputs: 129016
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

so 60 seconds and 1 GB to search 300k signatures?!

@ctb
Copy link
Contributor Author

ctb commented May 18, 2021

full genomic SBT is 15 GB for the 300k sigs.

@ctb
Copy link
Contributor Author

ctb commented May 18, 2021

ran

/usr/bin/time -v sourmash gather --no-linear --prefetch out.sig gtdb-rs202.genomic.k31.sbt.zip
/usr/bin/time -v sourmash gather --linear --prefetch out.sig gtdb-rs202.genomic.k31.sbt.zip
/usr/bin/time -v sourmash gather --no-linear --no-prefetch out.sig gtdb-rs202.genomic.k31.sbt.zip
/usr/bin/time -v sourmash gather --linear --no-prefetch out.sig gtdb-rs202.genomic.k31.sbt.zip
  • this is on ~280k sigs, GTDB all.

@ctb
Copy link
Contributor Author

ctb commented May 18, 2021

Results on Really Big files (15 GB .sbt.zip for ~280k all-GTDB)

Time Memory
no-linear/prefetch 4m 56s 1 GB
linear/prefetch 20m 8.8 GB
no-linear/no-prefetch 1h 16m 1 GB
linear/no-prefetch 20m 33s 8.8 GB

@ctb
Copy link
Contributor Author

ctb commented May 21, 2021

I'll label this with FAQ and leave this here.

@ctb
Copy link
Contributor Author

ctb commented May 4, 2022

integrated into docs with #2025.

@ctb ctb closed this as completed May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants