Skip to content

Definition of identity

Ryan Wick edited this page Jan 4, 2019 · 2 revisions

Badread defines identity the same way as BLAST does: the number of matching bases over the length of the alignment. Take this example of a 24 bp read which originated from a 24 bp fragment of DNA. The read has 3 errors: one deletion, one substitution and one insertion. This read's identity is 22 / 25 = 88%. Note that the denominator is the not the length of the read but rather the length of the alignment.

             Read:  ACGAC-CAGCAGTCGCGACTAGCTT
                    ||||| |||||| || |||||||||
Original sequence:  ACGACTCAGCAGACG-GACTAGCTT

You can read more on Heng Li's excellent blog post: On the definition of sequence identity.

Since DNA has only a 4 letter alphabet, two completely random sequences can typically align with >50% identity. As an example, here are two random sequences aligned to each other which match in 32 places over 59 alignment positions, giving an identity of 54%:

AAT-CGGCGCGTCCCGCGTTTCGGAAATTGA-C-ACTCTGACG-GTT---AGCACAG--
| | ||| | | |  || ||  ||   | || | | | ||| | |||   || | ||  
ATTACGG-GAG-C--GC-TTA-GGC--T-GAACTATTATGATGCGTTGCGAGAAAAGGA

This means that any read with less than about 60% identity is difficult to distinguish from random sequence.

Clone this wiki locally