Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what's required for Rust-based search parallelism (aka greyhound)? #1752

Open
ctb opened this issue Oct 10, 2021 · 10 comments
Open

what's required for Rust-based search parallelism (aka greyhound)? #1752

ctb opened this issue Oct 10, 2021 · 10 comments

Comments

@ctb
Copy link
Contributor

ctb commented Oct 10, 2021

From an e-mail conversation with @luizirber -

me:

Would love to get parallel search in Rust into sourmash. How far away are we?

I think I have a pretty well developed mental model of what is needed conceptually, but no actual ability to connect it to rust.

(I gather from our discussion that Oxidized LinearIndex #1526 is not directly related, because moving signatures around between Python and Rust is a terrible idea.)

I think getting FSStorage and ZipStorage equivalents into Rust might help? But the Storage class in Python could use some work. I might push forward on using more ZipStorage #1598 for that reason.

With the manifest-of-manifests work #1685 we’re getting closer and closer to having what I think is the right Python side thing for greyhound - “here’s a storage, here’s a list of signatures, please go search it.”

@luizirber response:

Let's see:

  • Definitely not too far away. But maybe merge at least the Rust side of the
    greyhound PR first, because PRs on top of PRs are annoying...
  • ZipStorage will not be very useful in Rust unless data can be read in
    parallel. The most common crate zip is very hard to use properly in parallel,
    but there is another crate piz (that only supports reading) which will work.
    I'll push the repo with what I tried out before trying to go for a sourmash
    Storage.
  • Developing ZipStorage is probably best done as a replacement for the Python
    ZipStorage (even if it is not going to replace it in the end, just to reuse
    tests). There are many quirks in writing the Zip file in the Python ZipStorage...
  • PR [WIP] Oxidized LinearIndex #1526 is still useful because Index operations work (even if slow with all
    the copying of data from one side to the other). But not for merging, just as
    reference.
  • The greyhound PR can pick some things from [WIP] Oxidized LinearIndex #1526, but I think it will be hard
    to go further before a ZipStorage (or FSStorage with write support) is
    implemented.
  • Erik used an approach of sending JSON instead of trying
    to define all the types in both sides (in the context of Lineages for LcaDB).
    This is probably useful while prototyping what is needed on each side.
@ctb ctb changed the title what's required for Rust-based search parallelism? what's required for Rust-based search parallelism (aka greyhound)? Oct 10, 2021
@ctb
Copy link
Contributor Author

ctb commented Oct 14, 2021

since @mr-eyes is asking about this in e-mail, I'm going to add some more commentary here - with reference to the code base in #1598, so, the branch at https://github.com/sourmash-bio/sourmash/tree/add/zipfile_use_storage:

the place to start, I think, is with getting read-only FSStorage and/or ZipStorage working in Rust as a new Storage class.

I think ZipStorage might be easiest since IIRC the zip reading code was quite straightforward, and my other hot take is that if you can get it to work for the API used in the first if block in src/sourmash/index.py::ZipFileLinearIndex:signatures(), you will have something useful and are probably 90% of the way to a fully functional ZipStorage.

The relevant code block is below, lightly edited:

            manifest = self.manifest

            # yield all signatures found in manifest                            
            for filename in manifest.locations():
                data = self.storage.load(filename)
                for ss in load_signatures(data):
                    # in case multiple signatures are in the file, check        
                    # to make sure we want to return each one.                  
                    if ss in manifest:
                        yield ss

and it implies the following:

  • you need a 'storage.load(...)` method that loads a specific file from a location inside a zip file;
  • that location can contain multiple signatures;

and that's it.

(The step after that involves making sure that you can load any type of file from a zipfile, not just a JSON signatures file.)

FSStorage is not much more complicated in practice so if you like I can dig into that code for you and get back to you, too.

Both ZipStorage and FSStorage are implemented in sbt_storage.py and I hope you listen to me when I suggest implementing the reading portion of them only, at least for now :). There's a bigger conversation to be had about refactoring the Storage class to be cleaner, but I'd like to avoid that conversation for as long as possible...

@ctb
Copy link
Contributor Author

ctb commented Oct 15, 2021

More conversation intended for @mr-eyes -

Looking at the greyhound PR #1238, I can't tell how far along it is. It probably needs to be adjusted to match / be useful for #1370 but I can't be sure without spending more time trying it out.

I still think ZipStorage is a place to start.

@ctb
Copy link
Contributor Author

ctb commented Oct 20, 2021

@mr-eyes @luizirber I wrote up a little Jupyter Notebook demonstrating how to use manifests and Storage classes together over in #1758. Enjoy!

Here is a direct link to the notebook that will expire if the PR is merged & the branch deleted.

@ctb
Copy link
Contributor Author

ctb commented Oct 20, 2021

err, and just to clarify, the API needed for greyhound is, I think, the read-only storage access - see the load_sigs function. If you can get something like that working on the Rust side, especially with a custom list of locations, then we would be well on our way to supporting greyhound.

@ctb
Copy link
Contributor Author

ctb commented Oct 21, 2021

@mr-eyes I added a demo notebook here that provides a Python implementation of what we discussed yesterday.

The load_sig(...) function is what we need to have working on the Rust side, and the search_list_of_sigs(...) function is the goal for greyhound functionality overall (just, you know, parallelized ;).

@mr-eyes
Copy link
Member

mr-eyes commented Oct 24, 2021

@mr-eyes I added a demo notebook here that provides a Python implementation of what we discussed yesterday.

The load_sig(...) function is what we need to have working on the Rust side, and the search_list_of_sigs(...) function is the goal for greyhound functionality overall (just, you know, parallelized ;).

This is very helpful, thank you! I'm starting to work on it tomorrow.

@mr-eyes
Copy link
Member

mr-eyes commented Oct 25, 2021

I have played around with piz, and it was effortless and straightforward to use with very comfortable docs.

@mr-eyes
Copy link
Member

mr-eyes commented Oct 26, 2021

There's a possibility of having gzipped files in a Zip file, right?

@luizirber
Copy link
Member

There's a possibility of having gzipped files in a Zip file, right?

Yes, and no: sourmash zipfiles use zipfile.STORED, which just puts the data in the file (doesn't try to compress it). Usually when data arrives at the ZipStorage it is already compressed (save_signatures with compression), because Rust is much faster for compression/decompression, and we can support other formats (any that niffler supports).

@ctb
Copy link
Contributor Author

ctb commented Sep 23, 2023

note also: plugin pyo3_branchwater, which implements this stuff: https://github.com/sourmash-bio/pyo3_branchwater

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants