-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make use of base quality score - Fastq support #71
Comments
Hi @nidlae, I am preparing a general answer to that question, but the manuscript is not ready yet. In short, swarm is a denoising tool, but not the only one we should apply to our data. In my opinion, clustering should be as far as possible a lossless process (same number of reads in and out). Quality filtering is a mean to reduce the size of datasets before clustering to speed-up computation. Since swarm is faster than other clustering methods, that's not a valid ordering of operations anymore. Here is what I do for my data:
After that, I only work on OTU representatives:
I then build an OTU table: OTUs vs samples, plus taxonomic assignment results, OTU dispersion in samples, chimeric status, and quality values. For each OTU representative, I search for the best expected error value divided by the length (a value of 0.0001 is ideal). In layman terms, I reject an OTU if I don't see at least once a copy of its representative sequence with a top quality (ee = 0.0001). In practice, I leave a small margin of error and keep OTUs with an ee value < 0.0002. As you can see, all filtering is rejected after the clustering to retain potential signal as long as possible. Only at the very end OTUs are filtered, using different sources of data (quality, taxonomic assignment or not, chimera or not, presence of the OTU in several samples or not, etc.). I hope it answers your question. |
Thank you for taking the time to discuss your methods. Your perspective is fascinating. How do you build your OTU table in such a way that it includes all that additional information about your swarm centroids? I know you describe this in the wiki and I would love to hear how you include more info. Colin |
yes, this answer my question beyond of what I have hoped for. I'm looking forward for the finished manuscript. I use amplicon sequencing to genotype samples with multi-clone infections, so the final filtering step is very crucial to determine the correct multiplicity of infection in a sample. Thanks a lot. |
I really like this approach about doing most (all) of the filtering as final steps. Looking forward to more. |
I'll try to describe a complete pipeline (from raw fastq to OTU table). I found an ITS1 experiment suitable (i.e. not too complicated), that I will describe on a github wiki page. I'll post the link here when it will be ready. Please be patient, I am completely swamped by other projects. |
That page describes my OTU delineation pipeline and my strategy to preserve and use read quality values. I hope it answers @nidlae's question. |
Thanks a lot @frederic-mahe https://github.com/frederic-mahe. Yes this helps a lot.
|
Hi,
thanks a lot for this software. The concept is very convincing to me.
I just like to ask, why the base quality score is not used during the clustering and de-replication process?
Is it not possible because of the algorithm or does it slow down the clustering process.
What about to incorporate the phred score after the OTUs are finished e.g. as a consensus quality score?
Thanks.
The text was updated successfully, but these errors were encountered: