## Estimating genome size

This page describes a method of estimating genome size from k-mers and sequencing read information. The equations are based on a review by [Sohn and Nam](https://academic.oup.com/bib/article-lookup/doi/10.1093/bib/bbw096)<sup>1</sup>.

#### Variables definition

Let;

$\small G$ = genome size in bases

$\small D$ = sequencing depth (or coverage)

$\small D'$ = average k-mer depth derived from the peak of a histogram of k-mer counts (see figure below)

$\small N_{base}$ = number of bases sequenced

$\small N_{reads}$ = number of reads sequenced

$\small k$ = k-mer length

$\small l$ = read length

<img align="left" style="padding-right:10px;" src="img/kmer_histogram.png">

#### Equations

Number of k-mers per read = $l - k + 1$

Genome size can be calculated as the number of bases sequenced divided by the coverage.

$\large D = \frac{N_{base}}{G}$

$\large G = \frac{N_{base}}{D}$

And since $N_{base} = N_{reads} \cdot l$,

$\large G = \frac{N_{reads} \cdot l}{D}$

The relationship between sequencing depth and k-mer is given by the [Velvet manual](https://www.ebi.ac.uk/~zerbino/velvet/Manual.pdf), section 5.1 as:

$\large D' = \frac{D \cdot (l - k + 1)}{l}$

Briefly, the k-mer coverage is the depth of coverage (bases) multiplied by the number of k-mers/read divided by the number of bases per read. In other words, this converts the coverage in bases to coverage in k-mers. 

Taking the inverse of that,

$\large D = \frac{D' \cdot l}{l - k + 1}$

Putting that into $G = \frac{N_{reads} \cdot l}{D}$,

$\large G = N_{reads} \cdot l \cdot \frac{l - k + 1}{D' \cdot l} $

$\large G = \frac{N_{reads} \cdot (l - k + 1)}{D'}$

#### References

1. Jang-il Sohn, Jin-Wu Nam; The present and future of de novo whole-genome assembly , Briefings in Bioinformatics, bbw096, https://doi.org/10.1093/bib/bbw096