Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how do we tell if an Index supports lazy/on-demand loading of signatures? #1895

Closed
ctb opened this issue Mar 25, 2022 · 6 comments
Closed

how do we tell if an Index supports lazy/on-demand loading of signatures? #1895

ctb opened this issue Mar 25, 2022 · 6 comments

Comments

@ctb
Copy link
Contributor

ctb commented Mar 25, 2022

motivated by #1891, there's an interesting distinction for Index classes that cannot currently be determined programmatically: does this Index class support efficient lazy or on-demand loading of signatures from disk?

For example, zip files and SBTs and manifest files do support this, while signature files, LCA databases and directories do not support this.

It would be nice to be able to figure this out programmatically.

Right now, the closest we come is the presence of a manifest attribute, which is present mostly on classes that support lazy loading. But this is confounded by the MultiIndex class which loads all the signatures into memory.

So maybe we need a new attribute on Index for this.

this would be a more useful distinction than is_database 🤔

@ctb
Copy link
Contributor Author

ctb commented Mar 25, 2022

interesting things to think about -

  • really, we're talking about low memory signature access here. For example, consider the SRA search case - are we loading only the signatures we're currently using, or do we load more?
  • what do we do about containers that can contain other containers, like MultiIndex (for pathlists) and manifests? do we figure it out recursively and set the attribute to True if and only if all of the containers beneath have the attribute?
  • can we provide an optional cmd line flag to require that things be lazy-loadable so that it's exposed to the user and we can help channel user intent / fail when requested?

ref #1425 and #1096

@ctb
Copy link
Contributor Author

ctb commented Mar 25, 2022

there is an interesting test that highlights some of the challenges here: test_lazy_loaded_index_1.

LCA Databases are potentially very large in-memory databases.

In #1837, we added manifest generating functionality to them by defining the _signatures_with_internal method on LCA_Database.

But we still do not provide a manifest attribute for them because it's not possible to direct-load signatures quickly - you have to load the entire LCA database, then generate signatures.

In #1891, we will add the ability to save a manifest for an LCA database to a separate file which will permit fast selection but not lazy loading of signatures from an LCA database.

🤯 whee!

@ctb
Copy link
Contributor Author

ctb commented Mar 26, 2022

Note also for the record that MultiIndex, which loads from files, directories and pathlists, is a terrible class for this, because it stores all the signatures in memory - even if they're in a zip file or another on-disk collection that supports lazy loading. see #1899.

A better way of doing this would be to support single-pass creation/loading of manifests and then use those manifests to do lazy loading from zipfiles etc.

@ctb
Copy link
Contributor Author

ctb commented Sep 5, 2022

💭 can we/should we directly support a LazySourmashSignature class that is just a manifest row, maybe with a link to storage so that we can interpret internal_location properly and use Index.get(...) or Storage.get(...)?

@ctb
Copy link
Contributor Author

ctb commented Mar 5, 2024

I think that the discussion around standalone manifests (#3023 and related) renders this moot: between zip files and standalone manifests, we have the right functionality implemented to support this. The things that don't support this - LCA JSON files, directories, and pathlists - are slowly being deprioritized in the documentation (see #3027). Finally, if we switch to outputting .sig.zip files by default per #3068, then we will fully support lazy loading on the default output formats.

@ctb
Copy link
Contributor Author

ctb commented Mar 5, 2024

(I think I'm going to close this as an unwelcome distraction ;)

@ctb ctb closed this as completed Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant