Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: add abund estimation to manysearch #302

Merged
merged 10 commits into from
Jul 3, 2024
Merged

MRG: add abund estimation to manysearch #302

merged 10 commits into from
Jul 3, 2024

Conversation

bluegenes
Copy link
Contributor

@bluegenes bluegenes commented Apr 19, 2024

Adds abundance columns average_abund, median_abund, std_abund to manysearch output.

Changed behavior:

  • Modifies manysearch to only build & write a result if we find an overlap between the query and search sigs. This was to avoid issues with abundance-related columns (average, median, std deviation calculations), but I think generally this:
  • shortens the results file to only relevant results and
  • speeds things up a tad, since we stop doing any unnecessary calculations.

To Do:

  • check vs sourmash/ mgmanysearch to ensure abund estimation is correct

Unfortunately, I can only estimate abundance info from non-rocksdb manysearch. Abund estimation for rocksdb databases would require us to estimate these values w/in sourmash, since we don't have access to which hashes match.

Note: for mgmanysearch type applications, the default threshold, 0.01` is far too stringent. Recommend setting threshold to 0 for these applications.

@bluegenes
Copy link
Contributor Author

Comparing this PR's abund manysearch vs sourmash_plugin_containment_search (mgmanysearch)

Small test dataset:

  • 100,000 query reads
  • 1000 contigs

built and run by @AnneliektH via: https://github.com/AnneliektH/2024-binning-kmer/blob/main/workflow/notebooks/notes_sourmash.ipynb

(both run with single thread)

software/version command details time max RAM
branchwater v0.9.3-dev manysearch with abundance 2s 103 MB
sourmash_plugin containment_search mgmanysearch with abundance 57s 111 MB

manysearch:

Command being timed: "sourmash scripts manysearch ../sourmash_sketches/sketch_reads/ERR1135178_QC.subset.zip ../sourmash_sketches/ERR11351_1000.zip -k 31 --scaled 100 -o ERR1135178.mn.timed.csv -c 1 -t 0"
        User time (seconds): 1.29
        System time (seconds): 0.69
        Percent of CPU this job got: 91%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.17
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 103264
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 9
        Minor (reclaiming a frame) page faults: 26718
        Voluntary context switches: 3122
        Involuntary context switches: 2879
        Swaps: 0
        File system inputs: 0
        File system outputs: 64
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

mgmanysearch:

Command being timed: "sourmash scripts mgmanysearch --against ../sourmash_sketches/sketch_reads/ERR1135178_QC.subset.zip --queries ../sourmash_sketches/ERR11351_1000.zip -k 31 --scaled 100 -o ERR1135178.timed.mgm.csv"        
        User time (seconds): 35.95
        System time (seconds): 5.88
        Percent of CPU this job got: 72%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:57.65
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 111128
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 4026
        Minor (reclaiming a frame) page faults: 312082
        Voluntary context switches: 60369
        Involuntary context switches: 605
        Swaps: 0
        File system inputs: 1021744
        File system outputs: 480
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

@ctb
Copy link
Collaborator

ctb commented Apr 23, 2024

wow, that's a fantastic speed improvement 😆

@ctb
Copy link
Collaborator

ctb commented Jun 21, 2024

per @bluegenes -

it's done - I was just waiting on some verification numbers from annie and lost track of it.

I will dig in!

@ctb ctb changed the title WIP: add abund estimation to manysearch MRG: add abund estimation to manysearch Jun 21, 2024
@ctb
Copy link
Collaborator

ctb commented Jun 21, 2024

(I will probably try to do some validation on this along with #363 and #366! I'm on a roll!)

@bluegenes
Copy link
Contributor Author

@ctb We also had a couple other thoughts going through this. I think they were:

  1. Would it be worth adding an --abundance-from param so the abundance could be from either query or search sigs?
  2. could we offer a different threshold too, e.g. minimum number of hashes? For very short query sequences, our default % threshold is far too stringent.

@ctb
Copy link
Collaborator

ctb commented Jun 21, 2024

@ctb We also had a couple other thoughts going through this. I think they were:

  1. Would it be worth adding an --abundance-from param so the abundance could be from either query or search sigs?

Yes! If you get to it before I review and merge, great. If not, I'll create an issue 😉

  1. could we offer a different threshold too, e.g. minimum number of hashes? For very short query sequences, our default % threshold is far too stringent.

Sure! Sensitive default => better here.

@ctb
Copy link
Collaborator

ctb commented Jul 1, 2024

ran https://hackmd.io/tKpLr1ISR9mqHHmZbEvcow?view with both mgmanysearch and this PR.

executed:

ls -1 f_prausnitzii.sig.gz b_uniformis.sig.gz > query.txt
ls -1 CD*.sig.gz > against.txt
sourmash scripts manysearch against.txt query.txt -o search2.csv -t 0

and then loaded both CSVs in and explored.

Screenshot 2024-07-01 at 3 50 33 PM

The numbers look the same, so 🎉 !!

A few interesting differences -

  • for this PR/manysearch, query is metagenomes, against is genomes; for mgmanysearch, the reverse;
  • the fraction of metagenome covered is not available from this PR (b/c of speed, so ✅ )
  • rows with 0s are not reported in this PR/manysearch, which again, fine!

I want to take a gander at the ANI values, but if those are fine (which I expect they will be :glare:) then I will merge, unless there is a last minute objection.

@ctb
Copy link
Collaborator

ctb commented Jul 3, 2024

🎉 they are identical. merging!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants