AutoTax2 is a vsearch-based workflow for SILVA-backed SSU rRNA sequence processing, taxonomy assignment, rank-wise clustering, reference extension, and source-overlap analysis.
Version: 0.1
AutoTax2 is designed around a fixed SILVA backbone.
- SILVA remains the backbone and is not re-clustered together with user sequences.
- User datasets and extension references are first inserted into the SILVA framework.
- Sequences that match SILVA above a rank-specific identity threshold inherit the corresponding SILVA-backed taxon assignment.
- Sequences that do not match SILVA above a threshold are clustered within the extension dataset and reported as novel rank-like taxa.
- VSEARCH performs the search and clustering work; Python handles file preparation, metadata parsing, workflow logic, and result summaries.
- AutoTax2 does not download SILVA automatically. Users provide local SILVA FASTA and metadata files.
- 📦 1. Installation
- 📋 2. Command overview
- 🗄️ 3. Prepare SILVA
- 🔍 4. Optional intron detection
- 🔗 5. Insert sequences into the SILVA backbone
- 📊 6. Multi-reference overlap
- ⚙️ 7. Classic helper workflow
- 📝 8. Input requirements
- 🛠️ 9. Development
AutoTax2 is a Python package. Install it into an existing Python environment.
gh repo clone ypchan/autotax2
python -m pip install -e .VSEARCH must be available on PATH, or you must provide its path with --vsearch.
Check the installation:
autotax2 --help
autotax2 check
vsearch --versionautotax2 --helpAvailable commands:
| Command | Description |
|---|---|
check |
Check external dependencies and optional reference files |
prepare-silva |
Prepare local SILVA FASTA and metadata files |
detect-intron |
Detect intron-like insertions and write analysis FASTA files |
insert-backbone |
Insert extension sequences into the SILVA backbone |
overlap-backbone |
Summarize taxon overlap across backbone assignment files |
derep |
Run VSEARCH dereplication |
cluster |
Run VSEARCH clustering at one or more identity levels |
classify |
Classify representative sequences using SINTAX plus SILVA/type-strain hits |
assign |
Assign new sequences to old centroids or create new clusters |
provenance |
Summarize source composition from UC clustering levels |
summarize |
Summarize existing classify outputs |
run |
Run dereplication, clustering and classification in one workflow |
Every command supports:
autotax2 <command> --help
autotax2 <command> --exampleAutoTax2 expects local SILVA files. You can download them from the SILVA website.
Download the SILVA files:
wget https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/full_metadata/SILVA_138.2_SSURef_Nr99.full_metadata.gz
wget https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/full_metadata/SILVA_138.2_SSURef_Nr99.full_metadata.gz.md5
wget https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/SILVA_138.2_SSURef_NR99_tax_silva.fasta.gz
wget https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/SILVA_138.2_SSURef_NR99_tax_silva.fasta.gz.md5Verify the downloads:
md5sum -c SILVA_138.2_SSURef.full_metadata.gz.md5
md5sum -c SILVA_138.2_SSURef_NR99_tax_silva.fasta.gz.md5Required files:
SILVA_138.2_SSURef_NR99_tax_silva.fasta.gzSILVA_138.2_SSURef.full_metadata.gz
Prepare a local reference directory:
autotax2 prepare-silva \
--silva-fasta SILVA_138.2_SSURef_NR99_tax_silva.fasta.gz \
--silva-metadata SILVA_138.2_SSURef.full_metadata.gz \
--out refdatabases \
--make-udb \
--threads autoNote:
prepare-silvacleans the input FASTA before downstream processing. The default behavior is strict:
- U/u is converted to T
- Only A/C/G/T is allowed after conversion
- Records containing N, ambiguity codes, gaps, dots, or any other non-ACGT character are dropped
- A summary report and a dropped-record report are generated
Output directory structure:
refdatabases/
├── <prefix>.fasta
├── <prefix>.udb
├── <prefix>_sintax.fasta
├── <prefix>_sintax.udb
├── <prefix>_typestrains.fasta
├── <prefix>_typestrains.udb
├── <prefix>_raw_input.fasta
├── <prefix>_fasta_cleaning_summary.tsv
├── <prefix>_fasta_cleaning_dropped.tsv
├── silva_taxonomy.tsv
├── typestrains_accessionIDs.txt
└── autotax2_ref_manifest.tsv
Use this when long-read 16S sequences may contain intron-like insertions.
autotax2 detect-intron \
--input hifimeta.original.fa \
--db refdatabases/<prefix>.udb \
--source-label hifimeta \
--out hifimeta_intron \
--search-id 0.70 \
--rescue-id 0.987 \
--min-intron-len 50 \
--min-flank-len 150 \
--threads autoImportant outputs:
hifimeta_intron/analysis_sequences.fa- Analysis FASTA for matching and clusteringhifimeta_intron/sequence_version_map.tsv- Sequence mapping file
With intron detection:
autotax2 insert-backbone \
--input hifimeta_intron/analysis_sequences.fa \
--original-fasta hifimeta.original.fa \
--version-map hifimeta_intron/sequence_version_map.tsv \
--source-label hifimeta \
--silva-manifest refdatabases/autotax2_ref_manifest.tsv \
--rank-thresholds default \
--db-format auto \
--out hifimeta_inserted \
--threads autoWithout intron detection:
autotax2 insert-backbone \
--input hifimeta.original.fa \
--original-fasta hifimeta.original.fa \
--source-label hifimeta \
--silva-manifest refdatabases/autotax2_ref_manifest.tsv \
--rank-thresholds default \
--db-format auto \
--out hifimeta_inserted \
--threads autoCommon outputs:
hifimeta_inserted/
├── sequence_rank_assignment.tsv
├── rank_taxa_summary.tsv
├── rank_uc/
├── rank_centroids_core/
└── rank_centroids_original/
Run insert-backbone for each reference dataset, then compare assignments:
autotax2 overlap-backbone \
--assignments \
ref2_inserted/sequence_rank_assignment.tsv \
ref3_inserted/sequence_rank_assignment.tsv \
ref4_inserted/sequence_rank_assignment.tsv \
hifimeta_inserted/sequence_rank_assignment.tsv \
--labels ref2,ref3,ref4,hifimeta \
--out ref_overlapTypical outputs:
ref_overlap/
├── taxon_presence_by_source.tsv
├── taxon_count_by_source.tsv
├── source_pairwise_overlap_by_rank.tsv
└── source_unique_taxa_by_rank.tsv
Dereplicate:
autotax2 derep \
--input input.fa \
--out work \
--sort \
--threads autoCluster:
autotax2 cluster \
--input work/derep_sorted.fa \
--out clusters \
--ids 0.99,0.97 \
--threads autoClassify:
autotax2 classify \
--input clusters/otu099_centroids.fa \
--ref-manifest refdatabases/autotax2_ref_manifest.tsv \
--out classify \
--threads autoOr run end to end:
autotax2 run \
--input input.fa \
--ref-manifest refdatabases/autotax2_ref_manifest.tsv \
--out autotax2_out \
--ids 0.99,0.97,0.95,0.90 \
--threads autoSILVA FASTA headers should contain an accession followed by semicolon-separated taxonomy:
>AB000001.1.1500 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus;Lactobacillus acidophilus;
ACGT...
The metadata table must contain at least:
acc
flags
Accessions whose flags column contains [T] or [t] are extracted as type-strain references.
The first token of each header is treated as the sequence ID:
>seq000001 optional description
ACGT...
Install in editable mode and run tests:
python -m pip install -e .
python -m pip install pytest
pytestCheck syntax quickly:
python -m py_compile autotax2/cli.py autotax2/prepare.py