Skip to content

Themis is a fast and robust metagenomic profiler that achieves high accuracy across ultra-low to high sequencing depths.

License

Notifications You must be signed in to change notification settings

xujialupaoli/Themis

 
 

Repository files navigation

Themis

Themis : a robust and accurate species-level metagenomic profiler.

Table of Contents


Overview

Themis is a fast and robust metagenomic profiler that achieves high accuracy across ultra-low to high sequencing depths. Themis combines a rapid, high-recall pre-screening step with graph-based refinement using colored de Bruijn graphs, reducing classification ambiguity and improving scalability to large reference databases.

Installation

conda create -n themis_env
conda activate themis_env
conda install -c bioconda -c conda-forge themis
## Run themis.
themis -h

Features

  • Commands
    • themis build-custom Build custom themis databases.
    • themis profile Profile reads against custom databases.

Quick start

  • 1-build-custom-reference-database
themis build-custom  --input-file input_genomes.txt --taxonomy-files nodes.dmp names.dmp --db-prefix themisDB --level strain -t $threads -k 19 -w 51

input_genomes.txt is a headerless, tab-separated manifest where each line contains (1) the absolute path to a genome FASTA file, (2) its strain_name, and (3) the corresponding strain-level NCBI taxid.

Due to the large size of the reference pangenome we used for testing, we provide the genomes_info.txt used here. You need to download these genomes from NCBI RefSeq and update the actual paths in genomes_info.txt. Please note that NCBI RefSeq periodically updates their database, so we cannot guarantee that all the listed genomes will be available. Building the reference pangenome takes approximately one week with this genomes_info.txt.

  • 2-profile
# short read(pair-end)
themis -r read1.fq -r $read2.fq --db-prefix themisDB --ref-info genomes_info.txt --out themis_query --threads 64 -k 31
# long read
themis -r $reads.fq --single --db-prefix themisDB --ref-info genomes_info.txt --out themis_query --threads 64 -k 31

genomes_info.txt is a tab-separated metadata table with a header line. The columns are, in order: strain_name, strain_taxid, species_taxid, species_name, and genome_path, where strain_name and strain_taxid must be unique and genome_path gives the absolute path to the corresponding genome FASTA file.

Output file:species_abundance.txt

Taxonomic_ID    Relative_Abundance
12345           0.0012
...

Output file:tax_profile.tre

no rank 131567  1|131567        cellular organisms      1.0000000000000000
superkingdom    2       1|131567|2      Bacteria        1.0000000000000000
phylum  1224    1|131567|2|1224 Pseudomonadota  0.37078199931442135
class   1236    1|131567|2|1224|1236    Gammaproteobacteria     0.30406830971011906
order   135614  1|131567|2|1224|1236|135614     Xanthomonadales 0.006280138828970609
family  32033   1|131567|2|1224|1236|135614|32033       Xanthomonadaceae        0.006280138828970609
genus   68      1|131567|2|1224|1236|135614|32033|68    Lysobacter      0.006280138828970609
species 69      1|131567|2|1224|1236|135614|32033|68|69 Lysobacter enzymogenes  0.006280138828970609
...

About

Themis is a fast and robust metagenomic profiler that achieves high accuracy across ultra-low to high sequencing depths.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 88.3%
  • Shell 11.7%