Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

losing sequences with vsearch --derep_prefix #270

Closed
gregcaporaso opened this issue Sep 25, 2017 · 5 comments
Closed

losing sequences with vsearch --derep_prefix #270

gregcaporaso opened this issue Sep 25, 2017 · 5 comments

Comments

@gregcaporaso
Copy link

I may be misunderstanding what vsearch --derep_prefix does exactly, but it seems to me that I'm losing some sequences when trying to dereplicate with this command.

Here is my input file (seqs.fna):

>sample1_1
AAACGTTACGGTTAACTATACATGCAGAAGACTAATCGG
>sample1_2
AAACGTTACGGTTAACTATACATGCAGAAGACTAATCGG
>s2_1
AAACGTTACGGTTAACTATACATGCAGAAGACTAATCGG
>s2_2
AAACGTTACGGTTAACTATACATGCAGAAGACTA
>s2_42
ACGTACGTACGTACGTACGTACGTACGTACGTGCATGGTGCGACCG
>s2_43
ACGTACGTACGTACGTACGTACGTACGTACGTGCATGGTGCGACCG

Here is the command and stdout:

$ vsearch --derep_prefix seqs.fna --uc out.uc
vsearch v2.0.3_osx_x86_64, 16.0GB RAM, 4 cores
https://github.com/torognes/vsearch

Reading file seqs.fna 100%
243 nt in 6 seqs, min 34, max 46, avg 40
Sorting by length 100%
Dereplicating 100%
Sorting 100%
2 unique sequences, avg cluster 3.0, median 3, max 4
Writing uc file, first part 100%
Writing uc file, second part 100%

And here is my output:

S	0	39	*	*	*	*	*	s2_1	*
S	1	46	*	*	*	*	*	s2_42	*
H	1	46	100.0	+	0	0	*	s2_43	s2_42
C	0	4	*	*	*	*	*	s2_1	*
C	1	2	*	*	*	*	*	s2_42	*

In this case, s2_2 is a prefix of sequences sample1_1, sample1_2, and s2_1, which are all identical. Shouldn't s2_2, sample1_1 and sample1_2 be in out.uc?

Thanks for the help!

gregcaporaso added a commit to gregcaporaso/q2-vsearch that referenced this issue Sep 25, 2017
I either found a bug in this functionality in vsearch, or I don't understand what it's supposed to be doing:

torognes/vsearch#270
@colinbrislawn
Copy link
Contributor

@gregcaporaso I think this is the same bug as #201, which is fixed in newer versions.

When using the newest version on bioconda, here is my output:

S	0	39	*	*	*	*	*	s2_1	*
H	0	34	100.0	+	0	0	*	s2_2	s2_1
H	0	39	100.0	+	0	0	*	sample1_1	s2_1
H	0	39	100.0	+	0	0	*	sample1_2	s2_1
S	1	46	*	*	*	*	*	s2_42	*
H	1	46	100.0	+	0	0	*	s2_43	s2_42
C	0	4	*	*	*	*	*	s2_1	*
C	1	2	*	*	*	*	*	s2_42	*

Not sure why s2_1 was selected as the centroid, but at least this list is complete.

@torognes
Copy link
Owner

torognes commented Sep 26, 2017

Thanks for reporting this bug, @gregcaporaso. @colinbrislawn is right, this was an earlier bug that was fixed in version 2.1.1. It failed to output the H-lines to the UC file when clustering.

@gregcaporaso
Copy link
Author

gregcaporaso commented Sep 26, 2017

Ok, thank you both for the input!

@frederic-mahe
Copy link
Collaborator

@colinbrislawn

Not sure why s2_1 was selected as the centroid, but at least this list is complete.

Identical sequences are sorted by decreasing abundance and label increasing alpha-numerical order. Here there are no abundance value, and the label s2_1 comes before sample1_1 after an alpha-numerical sorting.

@colinbrislawn
Copy link
Contributor

Ah, that's how it works!

I love how vsearch uses rounds of sorting to produce consistent, stable results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants