Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelization of sourmash search #2071

Open
mr-eyes opened this issue Jun 1, 2022 · 2 comments
Open

Parallelization of sourmash search #2071

mr-eyes opened this issue Jun 1, 2022 · 2 comments

Comments

@mr-eyes
Copy link
Member

mr-eyes commented Jun 1, 2022

I am doing an experiment that will require searching thousands of signatures (wort) and was thinking if there's a possibility to implement a parallel version of sourmash search to speed up the process. Maybe integrating the multi-processing Rust code would be great, or adding parallelization to the current Python code.

Relevent: #2069 #2066

@ctb
Copy link
Contributor

ctb commented Jun 1, 2022

hi @mr-eyes parallel search is not yet implemented in the Rust code included in the main sourmash codebase; see greyhound for that: #1752. Might not fit this use case anyway, since some of the individual metagenomes are quite large and the greyhound technique involves loading multiple of them into memory at once. Not sure.

@luizirber implemented a different approach in MAGsearch; see http://ivory.idyll.org/blog/2021-MAGsearch.html and https://blog.luizirber.org/2020/07/24/mag-results/ for background. The sra_search code loads many query genomes/metagenomes into memory and then does a parallel search against 100s of thousands of signatures.

Note that it only performs containment analyses, and not Jaccard similarity.

Somewhere in there, either Luiz or @bluegenes put together a snakemake setup that is working quite well for me, at least. My copy is on farm at ~ctbrown/scratch/magsearch. I think I use the command

snakemake -s magsearch.snakefile --configfile config-seaphage.yml -j 48

to run it.

Note that it's extremely disk intensive so we try to avoid running it with more than 48 threads / more than one at a time on the cluster.

@ctb
Copy link
Contributor

ctb commented Sep 4, 2023

https://github.com/sourmash-bio/pyo3_branchwater now covers some of this - see manysearch and multisearch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants