losing sequences with `vsearch --derep_prefix` #270

gregcaporaso · 2017-09-25T21:04:27Z

I may be misunderstanding what vsearch --derep_prefix does exactly, but it seems to me that I'm losing some sequences when trying to dereplicate with this command.

Here is my input file (seqs.fna):

>sample1_1
AAACGTTACGGTTAACTATACATGCAGAAGACTAATCGG
>sample1_2
AAACGTTACGGTTAACTATACATGCAGAAGACTAATCGG
>s2_1
AAACGTTACGGTTAACTATACATGCAGAAGACTAATCGG
>s2_2
AAACGTTACGGTTAACTATACATGCAGAAGACTA
>s2_42
ACGTACGTACGTACGTACGTACGTACGTACGTGCATGGTGCGACCG
>s2_43
ACGTACGTACGTACGTACGTACGTACGTACGTGCATGGTGCGACCG

Here is the command and stdout:

$ vsearch --derep_prefix seqs.fna --uc out.uc
vsearch v2.0.3_osx_x86_64, 16.0GB RAM, 4 cores
https://github.com/torognes/vsearch

Reading file seqs.fna 100%
243 nt in 6 seqs, min 34, max 46, avg 40
Sorting by length 100%
Dereplicating 100%
Sorting 100%
2 unique sequences, avg cluster 3.0, median 3, max 4
Writing uc file, first part 100%
Writing uc file, second part 100%

And here is my output:

S	0	39	*	*	*	*	*	s2_1	*
S	1	46	*	*	*	*	*	s2_42	*
H	1	46	100.0	+	0	0	*	s2_43	s2_42
C	0	4	*	*	*	*	*	s2_1	*
C	1	2	*	*	*	*	*	s2_42	*

In this case, s2_2 is a prefix of sequences sample1_1, sample1_2, and s2_1, which are all identical. Shouldn't s2_2, sample1_1 and sample1_2 be in out.uc?

Thanks for the help!

The text was updated successfully, but these errors were encountered:

I either found a bug in this functionality in vsearch, or I don't understand what it's supposed to be doing: torognes/vsearch#270

colinbrislawn · 2017-09-26T04:47:09Z

@gregcaporaso I think this is the same bug as #201, which is fixed in newer versions.

When using the newest version on bioconda, here is my output:

S	0	39	*	*	*	*	*	s2_1	*
H	0	34	100.0	+	0	0	*	s2_2	s2_1
H	0	39	100.0	+	0	0	*	sample1_1	s2_1
H	0	39	100.0	+	0	0	*	sample1_2	s2_1
S	1	46	*	*	*	*	*	s2_42	*
H	1	46	100.0	+	0	0	*	s2_43	s2_42
C	0	4	*	*	*	*	*	s2_1	*
C	1	2	*	*	*	*	*	s2_42	*

Not sure why s2_1 was selected as the centroid, but at least this list is complete.

torognes · 2017-09-26T12:41:10Z

Thanks for reporting this bug, @gregcaporaso. @colinbrislawn is right, this was an earlier bug that was fixed in version 2.1.1. It failed to output the H-lines to the UC file when clustering.

gregcaporaso · 2017-09-26T14:05:33Z

Ok, thank you both for the input!

frederic-mahe · 2017-09-27T16:44:14Z

@colinbrislawn

Not sure why s2_1 was selected as the centroid, but at least this list is complete.

Identical sequences are sorted by decreasing abundance and label increasing alpha-numerical order. Here there are no abundance value, and the label s2_1 comes before sample1_1 after an alpha-numerical sorting.

colinbrislawn · 2017-09-27T16:52:57Z

Ah, that's how it works!

I love how vsearch uses rounds of sorting to produce consistent, stable results.

gregcaporaso added a commit to gregcaporaso/q2-vsearch that referenced this issue Sep 25, 2017

removed --derep_prefix option

3e102d6

I either found a bug in this functionality in vsearch, or I don't understand what it's supposed to be doing: torognes/vsearch#270

gregcaporaso closed this as completed Sep 26, 2017

gregcaporaso mentioned this issue Sep 26, 2017

add support for --derep_prefix when we support vsearch 2.1.1 qiime2/q2-vsearch#24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

losing sequences with `vsearch --derep_prefix` #270

losing sequences with `vsearch --derep_prefix` #270

gregcaporaso commented Sep 25, 2017

colinbrislawn commented Sep 26, 2017

torognes commented Sep 26, 2017 •

edited

Loading

gregcaporaso commented Sep 26, 2017 •

edited

Loading

frederic-mahe commented Sep 27, 2017

colinbrislawn commented Sep 27, 2017

losing sequences with vsearch --derep_prefix #270

losing sequences with vsearch --derep_prefix #270

Comments

gregcaporaso commented Sep 25, 2017

colinbrislawn commented Sep 26, 2017

torognes commented Sep 26, 2017 • edited Loading

gregcaporaso commented Sep 26, 2017 • edited Loading

frederic-mahe commented Sep 27, 2017

colinbrislawn commented Sep 27, 2017

losing sequences with `vsearch --derep_prefix` #270

losing sequences with `vsearch --derep_prefix` #270

torognes commented Sep 26, 2017 •

edited

Loading

gregcaporaso commented Sep 26, 2017 •

edited

Loading