Skip to content

advantages of swarm over prefix clustering

Frédéric Mahé edited this page Oct 22, 2015 · 2 revisions

Prefix clustering works as follows:

A = ACGTACGT
B = ACGTACG

In that example, B is a prefix of A. When using prefix clustering A will become the representative of the cluster and B will be subsumed. When doing so, we assume that the longest sequence is the correct one. That choice is dubious, as it seems that the strongest clue at our disposal to identify the correct sequences is the abundance value: the locally abundant sequence is likely to be the correct one, and the surrounding less abundant sequences are likely to be errors deriving from the correct sequence. Of course, natural variability exists and some less abundant sequences can be real too (i.e. present in the wild), but it seems that the technical noise is the main source of variability.

When using prefix clustering, sequence A will be declared to be the representative of the cluster. Even if sequence B is far more abundant than sequence A, thus creating the false notion that sequence A is the canonical sequence. In the same situation, swarm avoids that trap and identifies the locally most abundant sequence as the representative.