Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make use of base quality score - Fastq support #71

Closed
lerch-a opened this issue Feb 17, 2016 · 7 comments
Closed

Make use of base quality score - Fastq support #71

lerch-a opened this issue Feb 17, 2016 · 7 comments
Labels

Comments

@lerch-a
Copy link

lerch-a commented Feb 17, 2016

Hi,

thanks a lot for this software. The concept is very convincing to me.

I just like to ask, why the base quality score is not used during the clustering and de-replication process?
Is it not possible because of the algorithm or does it slow down the clustering process.
What about to incorporate the phred score after the OTUs are finished e.g. as a consensus quality score?

Thanks.

@frederic-mahe
Copy link
Collaborator

Hi @nidlae,

I am preparing a general answer to that question, but the manuscript is not ready yet. In short, swarm is a denoising tool, but not the only one we should apply to our data.

In my opinion, clustering should be as far as possible a lossless process (same number of reads in and out). Quality filtering is a mean to reduce the size of datasets before clustering to speed-up computation. Since swarm is faster than other clustering methods, that's not a valid ordering of operations anymore.

Here is what I do for my data:

  • paired-ends assembly with vsearch,
  • demultiplexing and primer clipping with cutadapt,
  • fastq to fasta conversion, dereplication and quality values (i.e. expected error rates per sequence) with vsearch,
  • dereplication of the whole project (pool all samples) with vsearch,
  • clustering with swarm.

After that, I only work on OTU representatives:

  • taxonomic assignment with vsearch and custom scripts,
  • chimera detection with vsearch.

I then build an OTU table: OTUs vs samples, plus taxonomic assignment results, OTU dispersion in samples, chimeric status, and quality values. For each OTU representative, I search for the best expected error value divided by the length (a value of 0.0001 is ideal). In layman terms, I reject an OTU if I don't see at least once a copy of its representative sequence with a top quality (ee = 0.0001). In practice, I leave a small margin of error and keep OTUs with an ee value < 0.0002.

As you can see, all filtering is rejected after the clustering to retain potential signal as long as possible. Only at the very end OTUs are filtered, using different sources of data (quality, taxonomic assignment or not, chimera or not, presence of the OTU in several samples or not, etc.).

I hope it answers your question.

@colinbrislawn
Copy link

Thank you for taking the time to discuss your methods. Your perspective is fascinating.

How do you build your OTU table in such a way that it includes all that additional information about your swarm centroids? I know you describe this in the wiki and I would love to hear how you include more info.

Colin

@lerch-a
Copy link
Author

lerch-a commented Feb 18, 2016

Hi @frederic-mahe

yes, this answer my question beyond of what I have hoped for. I'm looking forward for the finished manuscript.
I share your opinion of a lossless process. Filtering of OTUs should be the last step.

I use amplicon sequencing to genotype samples with multi-clone infections, so the final filtering step is very crucial to determine the correct multiplicity of infection in a sample.

Thanks a lot.

@tobiasgf
Copy link

tobiasgf commented Mar 8, 2016

I really like this approach about doing most (all) of the filtering as final steps.
However I a wondering about how to keep the "..chimeric status, and quality values..." all the way to the OTU table. In other words how can you keep the ee-score (or ee/seq_length) of the best scoring read in an OTU through derepclication and swarm-clustering.
Do you get that when you map back the "original reads" against the OTU representatives?

Looking forward to more.

@frederic-mahe
Copy link
Collaborator

I'll try to describe a complete pipeline (from raw fastq to OTU table). I found an ITS1 experiment suitable (i.e. not too complicated), that I will describe on a github wiki page. I'll post the link here when it will be ready.

Please be patient, I am completely swamped by other projects.

@frederic-mahe
Copy link
Collaborator

That page describes my OTU delineation pipeline and my strategy to preserve and use read quality values.

I hope it answers @nidlae's question.

@lerch-a
Copy link
Author

lerch-a commented May 2, 2016

Thanks a lot @frederic-mahe https://github.com/frederic-mahe. Yes this helps a lot.

On 1 May 2016, at 3:43 AM, Frédéric Mahé notifications@github.com wrote:

Closed #71 #71.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub #71 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants