Skip to content
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
75 lines (58 sloc) 3.82 KB

Metagenomic species profiling

This script will map metagenomic reads to a database of phylogenetic marker genes using HS-BLASTN. Mapped reads are used estimate the read depth and relative abundance of 5,952 bacterial species. Reads are mapped according to gene-specific, species-level mapping thresholds (94.5-98% DNA identity). Reads that map equally well to 2 or more species are probabalistically assigned.

This module is required for downstream modules (i.e. genes and snps) if you plan to automatically identify abundant species with no prior knowlege of the community. If you already have a list of species ids that you're interested, this step can be skipped.


Usage: species outdir [options]

positional arguments:
  outdir             Path to directory to store results. 
                     Name should correspond to unique sample identifier.

optional arguments:
  -h, --help         show this help message and exit
  -1 M1              FASTA/FASTQ file containing 1st mate if using paired-end reads.
                     Otherwise FASTA/FASTQ containing unpaired reads.
                     Can be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2)
  -2 M2              FASTA/FASTQ file containing 2nd mate if using paired-end reads.
                     Can be gzip'ed (extension: .gz) or bzip2'ed (extension: .bz2)
  -n MAX_READS       Number of reads to use from input file(s) (use all)
  -t THREADS         Number of threads to use for database search (1)
  -d DB              Path to reference database
                     By default, the MIDAS_DB environmental variable is used
  --remove_temp      Remove temporary files, including BLAST output.
                     Useful for reducing disk space of MIDAS output
  --word_size INT    Word size for BLAST search (28)
                     Use word sizes > 16 for greatest efficiency.
  --mapid FLOAT      Discard reads with alignment identity < MAPID
                     By default gene-specific species-level cutoffs are used
                     Values between 0-100 accepted
  --aln_cov FLOAT    Discard reads with alignment coverage < ALN_COV (0.75)
                     Values between 0-1 accepted
  --read_length INT  Trim reads to READ_LENGTH and discard reads with length < READ_LENGTH
                     By default, reads are not trimmed or filtered


  1. run with defaults using a paired-end metagenome: species /path/to/outdir -1 /path/to/reads_1.fq.gz -2 /path/to/reads_2.fq.gz

  2. run using a single-end metagenome with 4 CPUs and only 4M reads: species /path/to/outdir -1 /path/to/reads_1.fq.gz -t 4 -n 4000000

  3. run with exactly 80 base-pair reads: species /path/to/outdir -1 /path/to/reads_1.fq.gz --read_length 80

Output files

species_profile.txt: tab-delimited with header; each line contains the abundance values for 1 species (5,952 total species) sorted by decreasing relative abundance.

log.txt: log file containing parameters used

temp: directory of intermediate files; run with --remove_temp to remove these files

Output formats


  • species_id: species identifier
  • count_reads: number of reads mapped to marker genes
  • coverage: estimated genome-coverage (i.e. read-depth) of species in metagenome
  • relative_abundance: estimated relative abundance of species in metagenome

Memory usage

  • < 1.5 Gb for most samples


  • ~5,000 reads/second for 100-bp reads when using default parameters
  • Use -n and -t to increase throughput
  • Note than using -n will result in underestimates of species genome-coverage in the full metagenome
  • We found that about 1 million reads was sufficient to precisely estimate species relative abundance for a gut community

Next step

Merge species abundance across samples

You can’t perform that action at this time.