how might we distribute "diff" or patch databases? #985

ctb · 2020-05-08T14:23:01Z

re #970, how might we move towards a model where we regularly (weekly? monthly?) release genbank/refseq databases that take into account new and revised genomes?

random thoughts -

updating taxonomy is no problem if we can override taxonomy per adding lineage manipulation & taxonomy reporting in more places in sourmash? #969 (comment) - just provide an updated lineage db. they're small.
updating LCA and SBTs is more of a problem. even if we could remove signatures from them, I would prefer not to ask people to re-download massive dbs!
right now we don't have the ability to "screen" signatures from searches in those databases. I suspect this wouldn't be too hard to add, perhaps via a selector API or something else. basically, add a way to specify that "this signature, as identified by md5sum, should never be reported."

this could tie into some of the work that @luizirber is doing with IPFS, I suspect.

luizirber · 2020-05-09T18:14:25Z

re #970, how might we move towards a model where we regularly (weekly? monthly?) release genbank/refseq databases that take into account new and revised genomes?

I think we should use the multi-DB capabilities in search/gather to:

release a full build every 2-3 months
release diffs each week, with only new genomes added
Not sure how to name them, but <year>.<week-of-the-year>.sbt.zip might work for the latter.

Taxonomy-wise, include the taxinfo in full builds, but for each week provide the updated taxinfo for full build + this week. I think this avoids issues with having to dig into every single DB for taxonomy information, and since taxonomy is also updated 'more frequently' (the signature/original dataset will never change for a specific version, but the taxonomic assignment DOES change) this allows more accurate results (and older gather CSVs can be updated with newer tax assignments without having to re-run gather, for example).

updating taxonomy is no problem if we can override taxonomy per #969 (comment) - just provide an updated lineage db. they're small.

I think we should continue reporting the dataset ID (be it GCA for genbank, or GCF for refseq, or similarly for GTDB) in the sourmash outputs, and then provide the taxinfo with the mapping to the lineage (connected to point above, about updating old results without having to re-run sourmash)

updating LCA and SBTs is more of a problem. even if we could remove signatures from them, I would prefer not to ask people to re-download massive dbs!
right now we don't have the ability to "screen" signatures from searches in those databases. I suspect this wouldn't be too hard to add, perhaps via a selector API or something else. basically, add a way to specify that "this signature, as identified by md5sum, should never be reported."

These two go together, I suspect: we can remove files from full builds, and maybe provide the 'screen' in weekly builds (as an additional file in the DB?) to indicate what matches to skip?

Note: Why would we want to remove signatures? We always provide the latest version of a genome, and so we need to remove the old one? Can genbank/refseq submission be retracted?

this could tie into some of the work that @luizirber is doing with IPFS, I suspect.

Incoming brain dump!

#456 is a fairly old PR, and I don't even know how to properly rebase it for today's codebase, but most of it migrated to other PRs:

ZipStorage in [MRG] add ZipStorage, support loading tree from storage #648
Split nodes into internal and leaves
the _fill_up method for doing bottom-up (leaves to root) processing of the SBT (used for setting min_n_below)
unload appeared briefly, but was not completely defined like in Expose an unload method for SBT nodes #784
(update_internal parameter for SBTs was not ported, and I think it is actually damaging for Add option for SBT creation to be localized #925, so probably skip it)

One thing still left (and connected to this issue) is the prepare command. The idea is to take a an index description (a .sbt.json file) and prepare a local copy for usage. There is a test showing how to use a .sbt.json with IPFS as storage, and load it into a FSStorage (hidden dir) locally. The _fill_up/repair comes into play because the IPFS .sbt.json can be leaf-only, and during prepare the steps would be:

Download all leaves (potentially in parallel)
Run _fill_internal, which creates all the internal nodes
Save to a new local SBT

There are a bunch of optimizations that can be done to avoid consuming too much memory:

As leaves are downloaded, save it to the Storage (they won't change)
For the internal level right above the leaves, build the internal node if all leaves are available, save it to storage, and unload it (and the leaves under it)
When root is reached, save the index description
This also fits well with the zipped SBTs.

So: I think this connects with IPFS and this issue because, instead of providing full ZIP files, we could provide only the description and change instruction to run prepare before using a DB. This is less convenient than wget/curl a DB, but if we are providing frequent updates it is simply unsunstainable to keep all that (redundant) data available permanently. Unless we find some sort of funding/sponsorship for it...

ctb · 2020-05-10T12:27:43Z

wow, that went in a direction all right. Not sure how to respond to the IPFS stuff, have to re-read that or maybe brainstorm in person :).

re

Note: Why would we want to remove signatures? We always provide the latest version of a genome, and so we need to remove the old one? Can genbank/refseq submission be retracted?

yes, some genomes are just broken and get removed or deprecated, and I don't think they should be available for search.

Note, for the genomeRxiv work, we will face similar questions of how to provide regular database updates. Since we should have actual funding for that, maybe that's a place to dig in!

luizirber · 2020-08-11T21:08:54Z

Feedback from personal comm:

Anything dead simply to retrieve and use. FTP is blocked at XXX and other institutions which I never would've believed when I was previously in academia. Even fetching rust libraries was blocked here.

ctb · 2021-04-21T23:44:59Z

#1477 could add support for "masking" arbitrary signatures from search and gather.

ctb · 2021-04-23T13:42:05Z

see also #433

ctb · 2021-07-06T13:58:08Z

a few quick thoughts -

picklist include and exclude can be used by pipelines to include only the updated signatures as well as exclude signatures/databases that have already been searched
while 'gather' results cannot easily be updated to reflect new databases, prefetch results can be and that then allows more efficient updating of gather.

ctb · 2022-03-12T15:58:47Z

this is a fascinating situation where we could actually use manifests. just thinking out loud:

my first (bad) idea is that we could simply edit manifests, since (as noted in #1849) there are situations where they don't necessarily contain all signatures, anyway.

a second (better?) idea is to add a 'deprecated' field that marks the signature as something to ignore.

a third (maybe actually good?) idea is to add a 'deprecated by' column that points at another signature (maybe an md5?).

a fourth (also maybe actually good) idea is to add a 'deprecates' column in database manifests that would support ignoring signatures in older databases. not sure how to best indicate which signature to ignore - md5 + identifier, maybe?

the first three ideas all involve modifying old databases. boo. the fourth only involves modifying new databases.

ctb · 2022-06-16T15:23:00Z

keyword search bait: database updates, update databases, incremental database updates

ctb mentioned this issue May 8, 2020

enable the addition (and removal?) of signatures from LCA databases. #849

Closed

luizirber mentioned this issue May 15, 2020

Make sourmash databases BDBag-compatible? #991

Open

luizirber mentioned this issue Jul 20, 2020

upgrade sourmash_databases soon (for sourmash 4.0) sourmash-bio/databases#10

Closed

luizirber mentioned this issue Aug 18, 2020

Merging two databases tongzhouxu/mashpit#13

Open

ctb mentioned this issue Apr 21, 2021

[MRG] Rework the find functionality for Index classes #1392

Merged

15 tasks

ctb mentioned this issue Apr 22, 2021

[MRG] Adjust Index.find search protocol to support selective collection of matches #1477

Merged

ctb added the eoss4 Tagged for consideration for EOSS grant. label Apr 23, 2021

luizirber mentioned this issue May 10, 2021

what databases do we want to provide, and how? #1511

Closed

taylorreiter mentioned this issue May 11, 2021

moving toward sourmash taxonomy for taxonomy reporting and manipulation from sourmash gather results #1515

Closed

ctb mentioned this issue Mar 21, 2022

split database construction and release processes; provide database catalogs #1569

Closed

ctb mentioned this issue Mar 30, 2022

manifests -> more interesting things with metadata #1916

Open

This was referenced Sep 5, 2022

make some kind of JSON list of all "official" sourmash databases #1005

Open

supporting file loading over HTTP - thoughts and concerns #2257

Open

ccbaumler mentioned this issue Apr 26, 2023

01_perform_dda.snakefile shuts down without error message on rule download_query_genome dib-lab/2022-dominating-set-differential-abundance-example#8

Open

ctb mentioned this issue Feb 23, 2024

sketch names in GTDB database use NCBI taxonomy #3006

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how might we distribute "diff" or patch databases? #985

how might we distribute "diff" or patch databases? #985

ctb commented May 8, 2020

luizirber commented May 9, 2020

ctb commented May 10, 2020

luizirber commented Aug 11, 2020

ctb commented Apr 21, 2021

ctb commented Apr 23, 2021

ctb commented Jul 6, 2021

ctb commented Mar 12, 2022

ctb commented Jun 16, 2022

how might we distribute "diff" or patch databases? #985

how might we distribute "diff" or patch databases? #985

Comments

ctb commented May 8, 2020

luizirber commented May 9, 2020

ctb commented May 10, 2020

luizirber commented Aug 11, 2020

ctb commented Apr 21, 2021

ctb commented Apr 23, 2021

ctb commented Jul 6, 2021

ctb commented Mar 12, 2022

ctb commented Jun 16, 2022