Skip to content

Clustering

Frédéric Mahé edited this page Jul 7, 2015 · 3 revisions

cluster_smallmem and cluster_fast: what's the difference?

There is a small difference between them: --cluster_fast will sort your sequences by length with the longest sequences first. If two sequences are equally long, they will be sorted alphabetically on the sequence label (identifier in fasta header) as the second key. --cluster_smallmem will not sort them, but it will check that the sequences are in length-sorted order, unless you specify the --usersort option, in which case it will not check. This behaviour is similar to usearch, as we try to make vsearch behave like usearch in most cases.

In short, --cluster_fast sorts the sequences by decreasing length, no matter what, while --cluster_smallmem expects the sequences to be sorted by decreasing length, or according to another criteria if --usersort is used.

The names "fast" and "smallmem" are a bit misleading as both commands are usually equally fast and memory hungry. Clustering results may be different if the sequences are processed in a different order, as it affects which sequences are used as centroids.

Why is my consensus sequence empty?

When using the options --msaout and --consout, a consensus sequence can be empty (header, but no nucleotides) under certain conditions. This is a consequence of the definition of the consensus: the consensus symbol in a column is a gap symbol (and is removed) if at least half the sequences contain a gap in that column. Otherwise the consensus symbol is the most common symbol of A, C, G or T in that column. If two symbols are equally common, the first of those symbol in alphabetical order is chosen.

That definition can trigger an issue when aligning a long centroid sequence and several shorter sequences that rarely overlap each other. The most common symbol will then be a gap symbol all over the alignment. This leads to a consensus containing only gap symbols, which will subsequently be removed, resulting in an empty sequence.

LONGSEQUENCE
LON---------
------QUE---
---GSE------
---------NCE

------------ empty consensus

Note that consensus sequences are written both with the --consout and --msaout commands. Gaps are removed in the consensus sequences written by --consout but remain in the consensus sequences written by --msaout.