Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support/provide/require manifests for collections of signatures? #1352

Closed
ctb opened this issue Feb 28, 2021 · 13 comments
Closed

support/provide/require manifests for collections of signatures? #1352

ctb opened this issue Feb 28, 2021 · 13 comments
Labels
revisit_me An issue that needs attention and clarification

Comments

@ctb
Copy link
Contributor

ctb commented Feb 28, 2021

In the ZipFileLinearIndex PR #1349 (comment), I'm thinking about signature manifests that provide some kind of metadata about the signature - ksize, moltype, tags, etc.

This could be used by selectors #1072 to avoid loading signatures that don't match the selector #1097 (comment).

The upsides to manifests are -

The downsides are -

  • another layer of messiness in terms of keeping information up to date, i.e. what if the contents of a signature file changes? this could be somewhat mitigated by supporting some kind of content-addressable storage and/or md5sum and/or file size and/or modification data.
  • we would want to avoid performance hits (such as the current performance hit involved in loading a large .sbt.json file before we proceed to mostly ignore it and iterate over all the signatures, as in greyhound)

My current take is that it would be good to support but not require this. We could prototype it for internal use for some of our larger databases, in particular, and start to build out the commands to create, store, and update manifests.

I guess this could then lead to a gradiation of collection/index storages:

  • level 0, random collection of files, gotta traverse and load them all to figure out if they're correct
  • level 1, partial/incomplete/untrusted manifest allowing ignoring of some of the signatures based on characteristics; this might be something where after a full traversal, a manifest is generated automatically for some cases (like zip files and directory indexes). note, this is actually be a pretty good use case for zip files, which can store things like manifests alongside signatures, unlike .sig files.
  • level 2, contents completely managed by sourmash, manifest is completely trustworthy (e.g. LCA/revindex databases, or SBTs)
@ctb
Copy link
Contributor Author

ctb commented Feb 28, 2021

would be nice to support manifests that are part of the collection, e.g.

@ctb
Copy link
Contributor Author

ctb commented Feb 28, 2021

information that could be part of manifests --

  • signature md5sum
  • name
  • filename
  • ksize
  • moltype
  • scaled/num
  • track_abundance

we'd want to figure out the signature vs sketch abstractions properly per #616.

So, probably, two tables at least -

  • signature - contains SourmashSignature information
  • sketch - contains MinHash information

and the sketch would then contain a pointer to the actual hashes.

@ctb
Copy link
Contributor Author

ctb commented Feb 28, 2021

an example of "optional and useful manifests that are not SBTs" - find signatures in SBTs by name #933 and more generally a sourmash sig grep command #1075

@ctb
Copy link
Contributor Author

ctb commented Mar 4, 2021

the most useful manifest would be one(s) that supported all of the selector options so that no signatures would need to be loaded to select them

@ctb
Copy link
Contributor Author

ctb commented Mar 6, 2021

I'm not quite sure how to name things, but wrt #1365, an alternate kind of manifest would be one that is a list of signatures to load from elsewhere. This is the "shopping cart" version of manifests, vs the "catalog" kind of manifest.

This is something that future prefetch/search/etc could then output a la #1365.

p.s. I kind of like the name 'picklist'.

@luizirber
Copy link
Member

I still feel like the index description can serve as the manifest for a collection of signature. In this sense, a LinearIndex would sort of fit, with a signatures field containing paths to signatures or another structure with more info (a la SigLeaf/SigStore #1097).

The benefit on this is that a LinearIndex turns into a base index that can be upgraded to an SBT or LCA/RevIndex/Greyhound as needed. In the LCA case the signatures are implicit and need to be rebuild from the index, but for the LinearIndex/SBT case they can be stored with the same layout (including into storages, if the new SBT format from #1201 is eventually merged, since it separates internal nodes and signatures in the storage).

@ctb
Copy link
Contributor Author

ctb commented Mar 11, 2021

I think we are converging! I'm not 100% sure what the code would look like (I usually only know after writing it and then refactoring it a few times :) and am also curious about how #1374 fits in, since this is/was needed to provided location info instead of modifying signatures(...).

(As an alternative, I was thinking about adding a get_signature_location to LinearIndex and the others but then decided we needed to think it through more - a la SigLeaf / SigStore, as you say.)

@luizirber
Copy link
Member

Quick comment on supporting multiple indexes per storage: during #648 I added the list_sbts method to ZipStorage with a trivial implementation, but having in mind that in the future it would be useful to have a similar method in Storage to be able to pull the indices descriptions and be able to select the desired one. Current impl just returns the first .sbt.json file it finds, but we can store multiple indices in one storage and share the signatures/ between them, for example.

@ctb
Copy link
Contributor Author

ctb commented Jun 13, 2021

in re

  • another layer of messiness in terms of keeping information up to date, i.e. what if the contents of a signature file changes? this could be somewhat mitigated by supporting some kind of content-addressable storage and/or md5sum and/or file size and/or modification data.
  • we would want to avoid performance hits (such as the current performance hit involved in loading a large .sbt.json file before we proceed to mostly ignore it and iterate over all the signatures, as in greyhound)

my experiences with ZipFileLinearIndex over in #1589 have convinced me that the ZipFile.infolist() method is fast enough that we can easily sync manifests, and more generally that techniques based around using infolist() will be reasonably performant.

It also seems reasonable for manifests to be an "expert level, use with caution" feature that we primarily use on large databases to support better performance. This is as opposed to picklists, which will be directly user-facing and should be robust in terms of reporting misuse or mistakes.

@ctb
Copy link
Contributor Author

ctb commented Jun 13, 2021

Quick comment on supporting multiple indexes per storage: during #648 I added the list_sbts method to ZipStorage with a trivial implementation, but having in mind that in the future it would be useful to have a similar method in Storage to be able to pull the indices descriptions and be able to select the desired one. Current impl just returns the first .sbt.json file it finds, but we can store multiple indices in one storage and share the signatures/ between them, for example.

Yep, I like this!

In #1590, I add (also, trivial :) support for signature manifests to Zipfile collections. While the implementation is not very clean or generic, I think we can evolve Storage classes a little bit to support the relevant functionality - we'd want some functionality like,

  • Storage.get_manifest(...) - loads exactly one manifest file
  • Index.signatures_with_internal(...), the current method for getting all signatures in a location in order to build a manifest.

In the short term, supporting manifests on SBTs would be pretty easy as long as we don't try to allow multiple indices within an SBT. I would probably start by supporting separate manifest creation, storage, and loading as per the ZipFileLinearIndex implementation in #1590. This would give us good picklist functionality with SBT.signatures(...) which is very useful in the short term for supporting charcoal and grist use cases.

In the longer-term, some questions will arise -

  • do we want to have separate SBT indices and manifests in a single file? I think the answer is "yes" because that way we can support full signature files in SBTs with Save full signature file in the SBT leaves? #198, and multiple indices on those signature files, all within a single zip file. (Which is good, right?)
    • if we want to support multiple SBT indices and want to keep them separate from a manifest, then I think we must forego using the SBT index itself as a manifest, because it will not necessarily have all the signatures in it and also that gets confusing.
    • if we don't want to support multiple SBT indices, we could upgrade the SBT index to support all the information that needs to be in the manifest, e.g. ksize and moltype (SBT JSON file should contain ksize and moltype #63).

@ctb
Copy link
Contributor Author

ctb commented Jun 13, 2021

hmm, maybe we want to start thinking about using Storage for arbitrary collections of signatures, with manifests; and Index for indexed collections that support search/gather/prefetch?

Then the .signatures() method (and related methods) are defined on Storage, while .find(...) is provided for Index classes, and things like SBTs and LCA DBs support both.

Could also add a Storage.indexes(...) that returns all indexes (currently just SBTs?) in any given storage.

Then manifests belong to Storage, and can be used efficiently with picklists; while Indexes are good for search, but don't work efficiently with picklists.

@ctb
Copy link
Contributor Author

ctb commented Jun 19, 2021

discussed in #1599 and connected links.

@ctb ctb added the revisit_me An issue that needs attention and clarification label Jun 26, 2021
@ctb
Copy link
Contributor Author

ctb commented Mar 26, 2022

closing in favor of #1901, which has the leftover bits from here.

@ctb ctb closed this as completed Mar 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
revisit_me An issue that needs attention and clarification
Projects
None yet
Development

No branches or pull requests

2 participants