Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove_many causes prefetch to hang on large-ish metagenome samples #1552

Closed
bluegenes opened this issue May 25, 2021 · 10 comments · Fixed by #1613
Closed

remove_many causes prefetch to hang on large-ish metagenome samples #1552

bluegenes opened this issue May 25, 2021 · 10 comments · Fixed by #1613

Comments

@bluegenes
Copy link
Contributor

example command:

sourmash prefetch SRR7186547.abundtrim.sig  databases/pl-genomes.dna-k31-scaled100.sbt.zip \
                  -k 31 --dna -o results/hangtest.sbt-prefetch.csv
== This is sourmash version 4.1.2.dev2+g516dc53b.d20210525. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

selecting specified query k=31
loaded query: SRR7186547... (k=31, DNA)
all sketches will be downsampled to scaled=1000
loading signatures from 'databases/pl-genomes.dna-k31-scaled100.sbt.zip'
total of 21 matching signatures.
saved 21 matches to CSV file 'results/hangtest.sbt-prefetch.csv'

When running via slurm, I noticed a prefetch on this sample wrote csv results in the first ~ minute, but then kept running (killed at ~ 1hr). When I ran via the command line, I noticed the job hung after this output, and could not be exited with Ctrl-C

With a dev install, I confirmed that commenting out remove_many (https://github.com/dib-lab/sourmash/blob/latest/src/sourmash/commands.py#L1169) enables prefetch to finish normally.

Note that prefetch does not hang when used within gather (though I do see now that remove_many is only used when calling prefetch directly from the command line.

Making the sig tiny or the searched database tiny (by downsampling) eliminated this issue, enabling the hash counting notifications to print, e.g.

of 85428 distinct query hashes, 53 were found in matches above threshold.
a total of 85375 query hashes remain unmatched.

For now, the sig and related databases can be found on our farm hpc at /home/ntpierce/prefetch_hangtest.

@luizirber
Copy link
Member

luizirber commented May 25, 2021

Great find! Maybe the question here is "do we want to report this?", or only report it if --save-matching-hashes/--save-unmatched-hashes is set (because then it is needed anyway).

The true fix is making remove_many more efficient, because it is a common use case. At the moment remove_many needs a list of hashes to remove (which is convenient, because we can pass a list or set with the hashes), but many times we are actually pulling the hashes from a MinHash without any modifications, and in this case it is better to pass the full MinHash into Rust and process all the data there (avoiding extracting hashes from the MinHash, then building a list, and so on).
This block needs to be changed:
https://github.com/dib-lab/sourmash/blob/516dc53b65727e8a1379b8dc2971eda158aedecf/src/sourmash/minhash.py#L307L309
into something that checks if the argument is a MinHash, and if so calls another Rust function, like add_many does:
https://github.com/dib-lab/sourmash/blob/516dc53b65727e8a1379b8dc2971eda158aedecf/src/sourmash/minhash.py#L296L305

related to #1517 (in the sense of "we have issues with gigantic signatures" =])

@ctb
Copy link
Contributor

ctb commented May 25, 2021

I'd be happy to change prefetch to either not report this, OR to use python sets underneath for the moment. I think the latter preserves nice functionality without triggering this particular problem, and doesn't cause new problems.

@mr-eyes
Copy link
Member

mr-eyes commented Jun 7, 2021

@bluegenes I think you can try again after merging #1571 - I'm curious to see the effect 😄

@bluegenes
Copy link
Contributor Author

Thanks for your efforts, @mr-eyes!!

Unfortunately, it's still quite long -- less than a minute for the prefetch, uh .. >9 hours for counting?

time sourmash prefetch SRR7186547.abundtrim.sig  databases/pl-genomes.dna-k31-scaled100.sbt.zip -k 31 --dna -o results/hangtest.sbt-prefetch.csv

== This is sourmash version 4.1.2. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

selecting specified query k=31
loaded query: SRR7186547... (k=31, DNA)
all sketches will be downsampled to scaled=1000
loading signatures from 'databases/pl-genomes.dna-k31-scaled100.sbt.zip'
total of 21 matching signatures.
saved 21 matches to CSV file 'results/hangtest.sbt-prefetch.csv'
of 8577001 distinct query hashes, 4775 were found in matches above threshold.
a total of 8572226 query hashes remain unmatched.

real    589m14.614s
user    584m35.585s
sys     3m48.278s

This is likely only a huge issue when using a very large query sig with quite a small search database, but this seems like a use case we want to support...

@bluegenes
Copy link
Contributor Author

@ctb could we have prefetch not report (or just optionally report)?

@ctb
Copy link
Contributor

ctb commented Jun 16, 2021

I'd rather fix the problem, which we know how to fix :). Just needs a little TLC.

@bluegenes
Copy link
Contributor Author

I'd rather fix the problem, which we know how to fix :). Just needs a little TLC.

ah, didn't realize the fix above wasn't the whole picture

@ctb
Copy link
Contributor

ctb commented Jun 16, 2021

Ahh, that's because I didn't put it in this issue... let me see if it's in the issues anywhere...

OK, it's this one, over in genome-grist: dib-lab/genome-grist#32

We should remove the unknown hashes as soon as we can, after the prefetch and before the gather. The challenge is doing the right accounting in the gather_databases() call; it was rather hairy. I'll put it on my list 😛 .

@ctb
Copy link
Contributor

ctb commented Jun 20, 2021

@bluegenes I think you can try again after merging #1571 - I'm curious to see the effect 😄

Note to all involved (myself included) - we actually needed to change the code in search.py that calls remove_many if we wanted to see an effect... 😆

@ctb
Copy link
Contributor

ctb commented Jun 20, 2021

ok, so #1613 seems to fix the problem in my hands, but it doesn't fix the underlying problem that for this particular metagenome, remove_many is really slow. We should create a new issue for that, once #1613 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants