Strain/species identification in metagenomes using genome-specific markers
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


GSMer devotes to identify genome-specific markers (GSMs) from currently sequenced microbial genomes using a k-mer based approach. Explored GSMs could be used to identify microbial strains/species in metagenomes, especially in human microbiome where many reference genomes are available. Two different levels of GSMs, including strain-specific and species-specifc GSMs are currently supported.

####Citation: #####Qichao Tu, Zhili He and Jizhong Zhou. “Strain/Species identification in metagenomes using genome-specific markers.” Nucleic Acids Research 2014; doi: 10.1093/nar/gku138

alt text

Identified GSMs

Species-specific GSMs: 2,005 species (4,933 strains).

Strain-specific GSMs: 4,088 strains

Full list of included microbial strains:


Third party programs:
* NCBI BLAST (megablast+formatdb) * KMER (meryl+mapMers)
Perl libraries:
* Bio::SeqIO (bioperl) * Getopt::Long * Parallel:ForkManager * String::Random


This tutorial shows how to identify GSMs for E.coli O157 with E.coli K12 genome as alien.
A total of seven steps are required for GSM identification, and need to run one by one. For details of all available options, please run perl -m help.

Testing files:
* O157.gbk: E.coli O157 genomes in genbank format, four differet strains were included (i.e. O157:H7 EC4115, O157:H7 EDL933, O157:H7 TW14359, and O157:H7 Sakai). * k12.fa: E.coli K-12 substr. W3110 genome in fatsa format. This genome will be used alien genome, resulting in O157-specific GSMs, which would not be found in the K-12 genome.
Steps to run:
0. Check the file, and set tax level at 1, which represents strain level. Make sure all other program path and GSM criteria are correct. 1. `perl -m splitgbk -i O157.gbk` This step split the O157.gbk file into four gbk files representing the four O157 strains. Four gbk files will be generated in a gbk directory. A strain.list file will also be generated in the working directory. 2. ` -m makeblastdb -f1 k12.fa` This step create a blast database file from all the four gbk files in the gbk directory, as well as the K-12 genome. 3. `perl -m makekmerdb -f1 k12.fa` This step create a k-mer database for all O157 genomes and k-12 genome. K-mers that show up in >=2 O157 genomes and all k-mers in K-12 genome are extracted for k-mer database construction. 4. `perl -m getgsm` This step generate all candidate GSMs for O157 strains. 5. `perl -m mapgsm` This step maps the above candidate GSMs to the k-mer database for continuous stretch filtering. 6. `perl -m blastgsm` This step performs blast searching unmapped GSMs against the blast database. 7. `perl -m checkspecificity` This step filters GSMs based the blast output for continuous stretch and identity with non-target genomes. Four *.out files containing detailed information of O157-specific GSMs will be generated in the GSM output directory specified in the file.