Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reimplement wrapping of Index classes from command-line for lower memory #1899

Open
ctb opened this issue Mar 26, 2022 · 1 comment
Open

Comments

@ctb
Copy link
Contributor

ctb commented Mar 26, 2022

The MultiIndex class is used for four purposes:

  • loading one or more signatures from JSON via stdin
  • loading one or more signatures from a JSON file
  • loading one or more .sig/.sig.gz files from within a directory via traversal
  • loading many signatures from many files (including zip, LCA, SBTs, and manifests) specified in a pathlist file

In #1425, I realized that MultiIndex always generates an in-memory manifest containing full signatures, even when wrapping other indices via a pathlist file.

In particular, this means that when loaded via a pathlist

  • all signatures from all databases are loaded into memory;
  • SBTs and LCAs are not searched using indexed search;

This is not good behavior, to say the least. Luckily we don't really suggest using pathlists anywhere in the documentation so it's not a pressing matter, but we should fix this sometime.

One way to fix this would be to use the StandaloneManifestIndex class from #1891 to load or generate all manifests of wrapped index objects upon load, and then store only the manifests in memory. This would support lazy loading of all the files, with the only negative being a performance hit from parsing JSON files (and LCA indices) each time we needed the actual signatures. I think this is probably OK as it is in keeping with the trend towards low-memory-is-better and since we have many good on-disk signature storage mechanisms now, it's probably better to point people towards those for storage.

ref #1895 #1096

@ctb ctb changed the title improve wrapping of Index classes from command-line functionality support low-memory wrapping of Index classes from command-line functionality Mar 26, 2022
@ctb ctb changed the title support low-memory wrapping of Index classes from command-line functionality reimplement wrapping of Index classes from command-line for lower memory Apr 16, 2022
@ctb
Copy link
Contributor Author

ctb commented Aug 15, 2022

One way to fix this would be to use the StandaloneManifestIndex class from #1891 to load or generate all manifests of wrapped index objects upon load, and then store only the manifests in memory. This would support lazy loading of all the files, with the only negative being a performance hit from parsing JSON files (and LCA indices) each time we needed the actual signatures. I think this is probably OK as it is in keeping with the trend towards low-memory-is-better and since we have many good on-disk signature storage mechanisms now, it's probably better to point people towards those for storage.

random thought: can we provide "hints" on Index classes that guide enclosing classes?

  • "this class is a memory hog" (e.g. LCA_Database)
  • "this class is on disk and permits fast lazy load" (e.g. ZipFileLinearIndex, SqliteIndex)
  • "this class doesn't support random indexing" (e.g. .sig.gz files)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant