benchmarks for different database formats. #1958

ctb · 2022-04-17T13:43:21Z

note: these were calculated with sourmash 4.3.

these should probably go somewhere close to the database pages, perhaps when we put out a new release - ref #1941

note, we have benchmarks on metagenomes against full genbank here.

benchmarks - prefetch

gtdb genomic reps, 65k sigs, DNA, ksize=31, db scaled=1000

query was SRR606249 / podar-ref.

query scaled=10,000

863 matches total;
53.9k query hashes, 19.0k found in matches above threshold. sqldb here was produced via sourmash sig flatten $zip -o $sqldb, sbt.zip with sourmash index $sbt $zip.

db format	db size	time	memory
sqldb	15 GB	28.2s	2.6 GB
sbt	3.5 GB	2m 43s	2.9 GB
zip	1.7 GB	5m 16s	1.9 GB

query scaled=1000

625 matches total;
374.6k query hashes, 189.1k found in matches above threshold

db format	db size	time	memory
sqldb	15 GB	3m 58s	9.9 GB
sbt	3.5 GB	7m 33s	2.6 GB
zip	1.7 GB	5m 53s	2.0 GB

thoughts

I was surprised by SBT being slower than the others, since it's pretty fast on simpler (single-genome) queries. I think it reflects a few different things - but is mostly about the complex query, along with how much faster everything else has become. (I ran the twice to make sure the numbers were legit!)

this is all single threaded; once we get multithreaded/rust-based searching of zip files in, zipfile search is gonna be smokin'.

sqldb is showing its value esp with higher scaled - since the scaled value is used as a constraint directly in the SQL query, we're searching a much smaller space of hashes. I was surprised to see the high memory usage, and it might be worth revisiting the code to see if that's coming from choices made in Python land (likely) or if that's internal to sqlite.

the extra on-disk size for sqldb is because the sqldb implementation has a lot of indices and doesn't seem to compress anything. I don't think we'll be distributing sqlite databases via download anytime soon 😆

benchmarks - LCA summarize

gtdb genomic reps, 65k sigs, DNA, ksize=31, db scaled=10,000

command:

sourmash lca summarize --query $QUERY \
    --db $DB -o lca.bench.$DB.csv

query scaled=10,000

53.9k query hashes

lca db format	db size	time	memory
SQL	1.6 GB	20s	380 MB
JSON	175 MB	1m 21s	6.2 GB

thoughts

is fast! and low memory!

as I wrote elsewhere, LCA-style queries into sqlite databases are one of the real pitches for #1808 - SqliteIndex itself is a nice proof of concept, but not compelling from a performance/disk space perspective. A fast on-disk approach will be nice! (SqliteCollectionManifest is also fantastic, FWIW.)

it's interesting to see the low memory for SQL here compared to the prefetch benchmarks. Makes me think that I'm doing something bad with memory in the SqliteIndex.find(...) code 🤔 .

The text was updated successfully, but these errors were encountered:

ctb · 2022-04-29T13:54:30Z

older gather benchmarks with a range of options: #1530

ctb · 2022-05-04T14:02:55Z

closing in favor of #2014

ctb mentioned this issue Apr 17, 2022

should we store signatures in SQLite databases? #1930

Closed

ctb mentioned this issue Apr 30, 2022

provide more comprehensive/useful database benchmarks #2014

Open

ctb closed this as completed May 4, 2022

ctb mentioned this issue May 7, 2022

investigate new sketch type, ArgMinHash-based FracMinHash #2039

Open

ctb mentioned this issue Feb 6, 2023

sourmash sketch & search use one thread only #2458

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmarks for different database formats. #1958

benchmarks for different database formats. #1958

ctb commented Apr 17, 2022 •

edited

Loading

ctb commented Apr 29, 2022

ctb commented May 4, 2022

benchmarks for different database formats. #1958

benchmarks for different database formats. #1958

Comments

ctb commented Apr 17, 2022 • edited Loading

benchmarks - prefetch

query scaled=10,000

query scaled=1000

thoughts

benchmarks - LCA summarize

query scaled=10,000

thoughts

ctb commented Apr 29, 2022

ctb commented May 4, 2022

ctb commented Apr 17, 2022 •

edited

Loading