Themis : a robust and accurate species-level metagenomic profiler.
Themis is a fast and robust metagenomic profiler that achieves high accuracy across ultra-low to high sequencing depths. Themis combines a rapid, high-recall pre-screening step with graph-based refinement using colored de Bruijn graphs, reducing classification ambiguity and improving scalability to large reference databases.
conda create -n themis_env
conda activate themis_env
conda install -c bioconda -c conda-forge themis
## Run themis.
themis -h
- Commands
themis build-customBuild custom themis databases.themis profileProfile reads against custom databases.
- 1-build-custom-reference-database
themis build-custom --input-file input_genomes.txt --taxonomy-files nodes.dmp names.dmp --db-prefix themisDB --level strain -t $threads -k 19 -w 51
input_genomes.txt is a headerless, tab-separated manifest where each line contains (1) the absolute path to a genome FASTA file, (2) its strain_name, and (3) the corresponding strain-level NCBI taxid.
Due to the large size of the reference pangenome we used for testing, we provide the genomes_info.txt used here. You need to download these genomes from NCBI RefSeq and update the actual paths in genomes_info.txt. Please note that NCBI RefSeq periodically updates their database, so we cannot guarantee that all the listed genomes will be available. Building the reference pangenome takes approximately one week with this genomes_info.txt.
- 2-profile
# short read(pair-end)
themis -r read1.fq -r $read2.fq --db-prefix themisDB --ref-info genomes_info.txt --out themis_query --threads 64 -k 31
# long read
themis -r $reads.fq --single --db-prefix themisDB --ref-info genomes_info.txt --out themis_query --threads 64 -k 31
genomes_info.txt is a tab-separated metadata table with a header line. The columns are, in order: strain_name, strain_taxid, species_taxid, species_name, and genome_path, where strain_name and strain_taxid must be unique and genome_path gives the absolute path to the corresponding genome FASTA file.
Output file:species_abundance.txt
Taxonomic_ID Relative_Abundance
12345 0.0012
...
Output file:tax_profile.tre
no rank 131567 1|131567 cellular organisms 1.0000000000000000
superkingdom 2 1|131567|2 Bacteria 1.0000000000000000
phylum 1224 1|131567|2|1224 Pseudomonadota 0.37078199931442135
class 1236 1|131567|2|1224|1236 Gammaproteobacteria 0.30406830971011906
order 135614 1|131567|2|1224|1236|135614 Xanthomonadales 0.006280138828970609
family 32033 1|131567|2|1224|1236|135614|32033 Xanthomonadaceae 0.006280138828970609
genus 68 1|131567|2|1224|1236|135614|32033|68 Lysobacter 0.006280138828970609
species 69 1|131567|2|1224|1236|135614|32033|68|69 Lysobacter enzymogenes 0.006280138828970609
...