Skip to content
findGSE is a tool for estimating size of (heterozygous diploid or homozygous) genomes by fitting k-mer frequencies iteratively with a skew normal distribution model.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R
example
man
.Rbuildignore
.gitignore
DESCRIPTION
INSTALL
NAMESPACE
README.md
findGSE.Rproj

README.md

findGSE

findGSE is a tool for estimating size of (heterozygous diploid or homozygous) genomes by fitting k-mer frequencies iteratively with a skew normal distribution model, which is written in R (code). The current version works on Linux & Mac OS X with R version 3.3.1 or above.

To use findGSE, one needs to input a k value and a corresponding k-mer histo file generated with short reads, which contains two tab-separated columns. The first column gives frequencies at which k-mers occur in reads, while the second column gives counts of such distinct k-mers (example).

Given multiple fastq.gz files, here is a two-step example for counting k-mers with jellyfish:

  zcat *.fastq.gz | jellyfish count /dev/fd/0 -C -o test_21mer -m 21 -t 1 -s 5G
  jellyfish histo -h 3000000 -o test_21mer.histo test_21mer

After getting the .histo file, supposing findGSE has been installed (INSTALL), we can do the following for GSE under R environment:

  library("findGSE")
  findGSE(histo="test_21mer.histo", sizek=21, outdir="hom_test_21mer")

Results will be printed like "Genome size estimate for test_21mer.histo: 1498918 bp." For more information about estimation, one can check the .txt and .pdf files in the output dir.

Two detailed toy examples about GSE for heterozygous and homozygous genomes are provided for playing around.

You can’t perform that action at this time.