Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how might we distribute "diff" or patch databases? #985

Open
ctb opened this issue May 8, 2020 · 8 comments
Open

how might we distribute "diff" or patch databases? #985

ctb opened this issue May 8, 2020 · 8 comments
Labels
eoss4 Tagged for consideration for EOSS grant.

Comments

@ctb
Copy link
Contributor

ctb commented May 8, 2020

re #970, how might we move towards a model where we regularly (weekly? monthly?) release genbank/refseq databases that take into account new and revised genomes?

random thoughts -

  • updating taxonomy is no problem if we can override taxonomy per adding lineage manipulation & taxonomy reporting in more places in sourmash? #969 (comment) - just provide an updated lineage db. they're small.
  • updating LCA and SBTs is more of a problem. even if we could remove signatures from them, I would prefer not to ask people to re-download massive dbs!
  • right now we don't have the ability to "screen" signatures from searches in those databases. I suspect this wouldn't be too hard to add, perhaps via a selector API or something else. basically, add a way to specify that "this signature, as identified by md5sum, should never be reported."

this could tie into some of the work that @luizirber is doing with IPFS, I suspect.

@luizirber
Copy link
Member

re #970, how might we move towards a model where we regularly (weekly? monthly?) release genbank/refseq databases that take into account new and revised genomes?

I think we should use the multi-DB capabilities in search/gather to:

  • release a full build every 2-3 months
  • release diffs each week, with only new genomes added
    Not sure how to name them, but <year>.<week-of-the-year>.sbt.zip might work for the latter.

Taxonomy-wise, include the taxinfo in full builds, but for each week provide the updated taxinfo for full build + this week. I think this avoids issues with having to dig into every single DB for taxonomy information, and since taxonomy is also updated 'more frequently' (the signature/original dataset will never change for a specific version, but the taxonomic assignment DOES change) this allows more accurate results (and older gather CSVs can be updated with newer tax assignments without having to re-run gather, for example).

updating taxonomy is no problem if we can override taxonomy per #969 (comment) - just provide an updated lineage db. they're small.

I think we should continue reporting the dataset ID (be it GCA for genbank, or GCF for refseq, or similarly for GTDB) in the sourmash outputs, and then provide the taxinfo with the mapping to the lineage (connected to point above, about updating old results without having to re-run sourmash)

updating LCA and SBTs is more of a problem. even if we could remove signatures from them, I would prefer not to ask people to re-download massive dbs!
right now we don't have the ability to "screen" signatures from searches in those databases. I suspect this wouldn't be too hard to add, perhaps via a selector API or something else. basically, add a way to specify that "this signature, as identified by md5sum, should never be reported."

These two go together, I suspect: we can remove files from full builds, and maybe provide the 'screen' in weekly builds (as an additional file in the DB?) to indicate what matches to skip?

Note: Why would we want to remove signatures? We always provide the latest version of a genome, and so we need to remove the old one? Can genbank/refseq submission be retracted?

this could tie into some of the work that @luizirber is doing with IPFS, I suspect.

Incoming brain dump!

#456 is a fairly old PR, and I don't even know how to properly rebase it for today's codebase, but most of it migrated to other PRs:

One thing still left (and connected to this issue) is the prepare command. The idea is to take a an index description (a .sbt.json file) and prepare a local copy for usage. There is a test showing how to use a .sbt.json with IPFS as storage, and load it into a FSStorage (hidden dir) locally. The _fill_up/repair comes into play because the IPFS .sbt.json can be leaf-only, and during prepare the steps would be:

  • Download all leaves (potentially in parallel)
  • Run _fill_internal, which creates all the internal nodes
  • Save to a new local SBT

There are a bunch of optimizations that can be done to avoid consuming too much memory:

  • As leaves are downloaded, save it to the Storage (they won't change)
  • For the internal level right above the leaves, build the internal node if all leaves are available, save it to storage, and unload it (and the leaves under it)
  • When root is reached, save the index description
    This also fits well with the zipped SBTs.

So: I think this connects with IPFS and this issue because, instead of providing full ZIP files, we could provide only the description and change instruction to run prepare before using a DB. This is less convenient than wget/curl a DB, but if we are providing frequent updates it is simply unsunstainable to keep all that (redundant) data available permanently. Unless we find some sort of funding/sponsorship for it...

@ctb
Copy link
Contributor Author

ctb commented May 10, 2020

wow, that went in a direction all right. Not sure how to respond to the IPFS stuff, have to re-read that or maybe brainstorm in person :).

re

Note: Why would we want to remove signatures? We always provide the latest version of a genome, and so we need to remove the old one? Can genbank/refseq submission be retracted?

yes, some genomes are just broken and get removed or deprecated, and I don't think they should be available for search.

Note, for the genomeRxiv work, we will face similar questions of how to provide regular database updates. Since we should have actual funding for that, maybe that's a place to dig in!

@luizirber
Copy link
Member

Feedback from personal comm:

Anything dead simply to retrieve and use. FTP is blocked at XXX and other institutions which I never would've believed when I was previously in academia. Even fetching rust libraries was blocked here.

@ctb
Copy link
Contributor Author

ctb commented Apr 21, 2021

#1477 could add support for "masking" arbitrary signatures from search and gather.

@ctb
Copy link
Contributor Author

ctb commented Apr 23, 2021

see also #433

@ctb
Copy link
Contributor Author

ctb commented Jul 6, 2021

a few quick thoughts -

  • picklist include and exclude can be used by pipelines to include only the updated signatures as well as exclude signatures/databases that have already been searched
  • while 'gather' results cannot easily be updated to reflect new databases, prefetch results can be and that then allows more efficient updating of gather.

@ctb
Copy link
Contributor Author

ctb commented Mar 12, 2022

this is a fascinating situation where we could actually use manifests. just thinking out loud:

my first (bad) idea is that we could simply edit manifests, since (as noted in #1849) there are situations where they don't necessarily contain all signatures, anyway.

a second (better?) idea is to add a 'deprecated' field that marks the signature as something to ignore.

a third (maybe actually good?) idea is to add a 'deprecated by' column that points at another signature (maybe an md5?).

a fourth (also maybe actually good) idea is to add a 'deprecates' column in database manifests that would support ignoring signatures in older databases. not sure how to best indicate which signature to ignore - md5 + identifier, maybe?

the first three ideas all involve modifying old databases. boo. the fourth only involves modifying new databases.

@ctb
Copy link
Contributor Author

ctb commented Jun 16, 2022

keyword search bait: database updates, update databases, incremental database updates

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
eoss4 Tagged for consideration for EOSS grant.
Projects
None yet
Development

No branches or pull requests

2 participants