-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best practice for ONT metagenomics #3070
Comments
Hi @ezherman I have nothing but intuition to offer, I'm afraid! Hopefully some others will step up with some better experience!
Right! I would expect that! My question is, do you need most k-mers to be classified, or would you be satisfied with having enough k-mers be classifiable to get accurate results? There are two, mmm, schools of thought on taxonomy. One is taxonomic binning, where you are trying to assign each sequence to a specific taxonomic group. The other is taxonomic profiling, where you are trying to identify taxonomic groups present in your entire data set. sourmash has (so far) mostly been used for taxonomic profiling, where it is very accurate for reasons I think we can explain (combinations of k-mers, basically). I would not suggest using sourmash directly for binning (yet), for a few reasons - the main one being that we are only just now getting to the point where it performs well with the lower scaled value needed. (See the branchwater plugin for our rather successful efforts to improve the performance of However, if you want to get binning-like results, there's a third option here, and it's one you might consider: sourmash actually does genomic profiling of the composition of a metagenome with We have a workflow that implements this for Illumina, called genome-grist. Happy to chat more about that here or there.
(thank you for this summary!)
Agreed, this is a bad idea!
I was going to say k=15 ;). I agree. The only challenge here is you'd need to build your own database at the k-mer size(s) of interest. That has become much faster and easier with
Yep. It's likely to reduce your computational costs and maybe your sensitivity, but should not have any effect on specificity.
Yep. I would say that the computational requirements are less of an issue now, with the branchwater plugin.
If you can identify a simple benchmark data set for us to try out (Maybe the one from Portik et al would work - @bluegenes ?) we could potentially do some of this for you... |
Just a quick follow-up - on re-reading the above, I wanted to make it clear: unknown/non-matching k-mers are not a problem for sourmash. They are just ignored, except in some summary statistics. You might just try running at k=21 (for which databases are available) and see what happens. |
Thank you @ctb for your detailed response! Below some thoughts/comments. I appreciate your overview of profiling vs mapping vs profiling + alignment. This is a great overview of the options, thank you! At first, I am looking for a taxonomic profile that accurately represents the species in my sample, to the extent in which they are represented in my database. Secondly, I do hope to map reads back to the genomic profile reported by sourmash. Genome-grist looks great, and I would hope to implement something similar for ONT data. But for now, my focus is on improving the genomic profile. I would be happy to continue the conversation regarding alignment on the genome-grist page. Thank you for highlighting Having played a bit with As for single-abundance kmers: so far, I have found that removing single-abundance kmers massively reduces the size of my signature file (e.g. ~11mb to ~650kb). But I haven’t yet tested this on many samples, nor in any systematic way.
It would be great to work together on this! Most of my samples are from undefined communities, so they may not be ideal for benchmarking. I’d be very interested to hear if @bluegenes has any suggestions. |
Sure, here or there. genome-grist is kind of suffering from neglect at the moment and needs to be updated...
👍 it's quite a game changer in terms of speed!
So, three thoughts here - and first, I feel like we need to apologize for the construction mess while we're "in transition" -
I personally don't like removing low-abundance k-mers, since it reduces sensitivity ;).
:) |
Ah, great! I will give this a try.
Makes sense. It would be a shame to miss out on low abundance organisms, so it probably won't be my approach either. |
I've written a short |
note: as of sourmash_plugin_branchwater v0.9.5 link, the results from |
Thank you @ctb! I hope to make time soon to implement the latest version of the plugin. |
Both kraken2 & sourmash use a k-mer-based approach. Why doesn't sourmash work well for ONT metagenomics while kraken2 works fine? |
See the Portik paper in which the precision of Kraken2 is incredibly low on ONT sequencing of the Zymo mock community. In other words, that works suggests Kraken2 has a high false positive rate on ONT sequencing data. I wouldn't say that's fine! |
As per the paper, sourmash produces much better results compared to kraken? |
I am in the process of determining whether sourmash is the right tool for my metagenomic analyses. I am mostly using ONT data at the moment. Similar to previous posts (e.g. #2236), I have found that sourmash cannot classify most of the kmers.
I was wondering whether there is a "best practice" for using sourmash with ONT data. To aid the discussion, I thought I'd list all the potential processes and parameters that I think may help. Before trying to optimise them all, I thought I'd check whether others have had any success!
The following seem relevant:
--threshold-bp
: computational requirements go up with a lower threshold, however I understand this to also increase sensitivity (adjust default thresholding for prefetch and gather? #2360 (comment)).If anyone has had success with these adjustments on ONT data, or is aware of any other modifications, I would really appreciate your input.
The text was updated successfully, but these errors were encountered: