Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

benchmarks for different database formats. #1958

Closed
ctb opened this issue Apr 17, 2022 · 2 comments
Closed

benchmarks for different database formats. #1958

ctb opened this issue Apr 17, 2022 · 2 comments

Comments

@ctb
Copy link
Contributor

ctb commented Apr 17, 2022

note: these were calculated with sourmash 4.3.

these should probably go somewhere close to the database pages, perhaps when we put out a new release - ref #1941


note, we have benchmarks on metagenomes against full genbank here.

benchmarks - prefetch

gtdb genomic reps, 65k sigs, DNA, ksize=31, db scaled=1000

query was SRR606249 / podar-ref.

query scaled=10,000

863 matches total;
53.9k query hashes, 19.0k found in matches above threshold. sqldb here was produced via sourmash sig flatten $zip -o $sqldb, sbt.zip with sourmash index $sbt $zip.

db format db size time memory
sqldb 15 GB 28.2s 2.6 GB
sbt 3.5 GB 2m 43s 2.9 GB
zip 1.7 GB 5m 16s 1.9 GB

query scaled=1000

625 matches total;
374.6k query hashes, 189.1k found in matches above threshold

db format db size time memory
sqldb 15 GB 3m 58s 9.9 GB
sbt 3.5 GB 7m 33s 2.6 GB
zip 1.7 GB 5m 53s 2.0 GB

thoughts

I was surprised by SBT being slower than the others, since it's pretty fast on simpler (single-genome) queries. I think it reflects a few different things - but is mostly about the complex query, along with how much faster everything else has become. (I ran the twice to make sure the numbers were legit!)

this is all single threaded; once we get multithreaded/rust-based searching of zip files in, zipfile search is gonna be smokin'.

sqldb is showing its value esp with higher scaled - since the scaled value is used as a constraint directly in the SQL query, we're searching a much smaller space of hashes. I was surprised to see the high memory usage, and it might be worth revisiting the code to see if that's coming from choices made in Python land (likely) or if that's internal to sqlite.

the extra on-disk size for sqldb is because the sqldb implementation has a lot of indices and doesn't seem to compress anything. I don't think we'll be distributing sqlite databases via download anytime soon 😆

benchmarks - LCA summarize

gtdb genomic reps, 65k sigs, DNA, ksize=31, db scaled=10,000

command:

sourmash lca summarize --query $QUERY \
    --db $DB -o lca.bench.$DB.csv

query scaled=10,000

53.9k query hashes

lca db format db size time memory
SQL 1.6 GB 20s 380 MB
JSON 175 MB 1m 21s 6.2 GB

thoughts

is fast! and low memory!

as I wrote elsewhere, LCA-style queries into sqlite databases are one of the real pitches for #1808 - SqliteIndex itself is a nice proof of concept, but not compelling from a performance/disk space perspective. A fast on-disk approach will be nice! (SqliteCollectionManifest is also fantastic, FWIW.)

it's interesting to see the low memory for SQL here compared to the prefetch benchmarks. Makes me think that I'm doing something bad with memory in the SqliteIndex.find(...) code 🤔 .

@ctb
Copy link
Contributor Author

ctb commented Apr 29, 2022

older gather benchmarks with a range of options: #1530

@ctb
Copy link
Contributor Author

ctb commented May 4, 2022

closing in favor of #2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant