🧬 AutoTax2

AutoTax2 is a vsearch-based workflow for SILVA-backed SSU rRNA sequence processing, taxonomy assignment, rank-wise clustering, reference extension, and source-overlap analysis.

Version: 0.1

🎯 Design goals

AutoTax2 is designed around a fixed SILVA backbone.

SILVA remains the backbone and is not re-clustered together with user sequences.
User datasets and extension references are first inserted into the SILVA framework.
Sequences that match SILVA above a rank-specific identity threshold inherit the corresponding SILVA-backed taxon assignment.
Sequences that do not match SILVA above a threshold are clustered within the extension dataset and reported as novel rank-like taxa.
VSEARCH performs the search and clustering work; Python handles file preparation, metadata parsing, workflow logic, and result summaries.
AutoTax2 does not download SILVA automatically. Users provide local SILVA FASTA and metadata files.

📚 Table of Contents

📦 1. Installation
📋 2. Command overview
🗄️ 3. Prepare SILVA
🔍 4. Optional intron detection
🔗 5. Insert sequences into the SILVA backbone
📊 6. Multi-reference overlap
⚙️ 7. Classic helper workflow
📝 8. Input requirements
🛠️ 9. Development

📦 1. Installation

AutoTax2 is a Python package. Install it into an existing Python environment.

gh repo clone ypchan/autotax2
python -m pip install -e .

VSEARCH must be available on PATH, or you must provide its path with --vsearch.

Check the installation:

autotax2 --help
autotax2 check
vsearch --version

📋 2. Command overview

autotax2 --help

Available commands:

Command	Description
`check`	Check external dependencies and optional reference files
`prepare-silva`	Prepare local SILVA FASTA and metadata files
`detect-intron`	Detect intron-like insertions and write analysis FASTA files
`insert-backbone`	Insert extension sequences into the SILVA backbone
`overlap-backbone`	Summarize taxon overlap across backbone assignment files
`derep`	Run VSEARCH dereplication
`cluster`	Run VSEARCH clustering at one or more identity levels
`classify`	Classify representative sequences using SINTAX plus SILVA/type-strain hits
`assign`	Assign new sequences to old centroids or create new clusters
`provenance`	Summarize source composition from UC clustering levels
`summarize`	Summarize existing classify outputs
`run`	Run dereplication, clustering and classification in one workflow

Every command supports:

autotax2 <command> --help
autotax2 <command> --example

🗄️ 3. Prepare SILVA

AutoTax2 expects local SILVA files. You can download them from the SILVA website.

Download the SILVA files:

wget https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/full_metadata/SILVA_138.2_SSURef_Nr99.full_metadata.gz
wget https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/full_metadata/SILVA_138.2_SSURef_Nr99.full_metadata.gz.md5
wget https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/SILVA_138.2_SSURef_NR99_tax_silva.fasta.gz
wget https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/SILVA_138.2_SSURef_NR99_tax_silva.fasta.gz.md5

Verify the downloads:

md5sum -c SILVA_138.2_SSURef.full_metadata.gz.md5
md5sum -c SILVA_138.2_SSURef_NR99_tax_silva.fasta.gz.md5

Required files:

SILVA_138.2_SSURef_NR99_tax_silva.fasta.gz
SILVA_138.2_SSURef.full_metadata.gz

Prepare a local reference directory:

autotax2 prepare-silva \
  --silva-fasta SILVA_138.2_SSURef_NR99_tax_silva.fasta.gz \
  --silva-metadata SILVA_138.2_SSURef.full_metadata.gz \
  --out refdatabases \
  --make-udb \
  --threads auto

Note: prepare-silva cleans the input FASTA before downstream processing. The default behavior is strict:

U/u is converted to T

Only A/C/G/T is allowed after conversion

Records containing N, ambiguity codes, gaps, dots, or any other non-ACGT character are dropped

A summary report and a dropped-record report are generated

Output directory structure:

refdatabases/
  ├── <prefix>.fasta
  ├── <prefix>.udb
  ├── <prefix>_sintax.fasta
  ├── <prefix>_sintax.udb
  ├── <prefix>_typestrains.fasta
  ├── <prefix>_typestrains.udb
  ├── <prefix>_raw_input.fasta
  ├── <prefix>_fasta_cleaning_summary.tsv
  ├── <prefix>_fasta_cleaning_dropped.tsv
  ├── silva_taxonomy.tsv
  ├── typestrains_accessionIDs.txt
  └── autotax2_ref_manifest.tsv

🔍 4. Optional intron detection

Use this when long-read 16S sequences may contain intron-like insertions.

autotax2 detect-intron \
  --input hifimeta.original.fa \
  --db refdatabases/<prefix>.udb \
  --source-label hifimeta \
  --out hifimeta_intron \
  --search-id 0.70 \
  --rescue-id 0.987 \
  --min-intron-len 50 \
  --min-flank-len 150 \
  --threads auto

Important outputs:

hifimeta_intron/analysis_sequences.fa - Analysis FASTA for matching and clustering
hifimeta_intron/sequence_version_map.tsv - Sequence mapping file

🔗 5. Insert sequences into the SILVA backbone

With intron detection:

autotax2 insert-backbone \
  --input hifimeta_intron/analysis_sequences.fa \
  --original-fasta hifimeta.original.fa \
  --version-map hifimeta_intron/sequence_version_map.tsv \
  --source-label hifimeta \
  --silva-manifest refdatabases/autotax2_ref_manifest.tsv \
  --rank-thresholds default \
  --db-format auto \
  --out hifimeta_inserted \
  --threads auto

Without intron detection:

autotax2 insert-backbone \
  --input hifimeta.original.fa \
  --original-fasta hifimeta.original.fa \
  --source-label hifimeta \
  --silva-manifest refdatabases/autotax2_ref_manifest.tsv \
  --rank-thresholds default \
  --db-format auto \
  --out hifimeta_inserted \
  --threads auto

Common outputs:

hifimeta_inserted/
  ├── sequence_rank_assignment.tsv
  ├── rank_taxa_summary.tsv
  ├── rank_uc/
  ├── rank_centroids_core/
  └── rank_centroids_original/

📊 6. Multi-reference overlap

Run insert-backbone for each reference dataset, then compare assignments:

autotax2 overlap-backbone \
  --assignments \
    ref2_inserted/sequence_rank_assignment.tsv \
    ref3_inserted/sequence_rank_assignment.tsv \
    ref4_inserted/sequence_rank_assignment.tsv \
    hifimeta_inserted/sequence_rank_assignment.tsv \
  --labels ref2,ref3,ref4,hifimeta \
  --out ref_overlap

Typical outputs:

ref_overlap/
  ├── taxon_presence_by_source.tsv
  ├── taxon_count_by_source.tsv
  ├── source_pairwise_overlap_by_rank.tsv
  └── source_unique_taxa_by_rank.tsv

⚙️ 7. Classic helper workflow

Dereplicate:

autotax2 derep \
  --input input.fa \
  --out work \
  --sort \
  --threads auto

Cluster:

autotax2 cluster \
  --input work/derep_sorted.fa \
  --out clusters \
  --ids 0.99,0.97 \
  --threads auto

Classify:

autotax2 classify \
  --input clusters/otu099_centroids.fa \
  --ref-manifest refdatabases/autotax2_ref_manifest.tsv \
  --out classify \
  --threads auto

Or run end to end:

autotax2 run \
  --input input.fa \
  --ref-manifest refdatabases/autotax2_ref_manifest.tsv \
  --out autotax2_out \
  --ids 0.99,0.97,0.95,0.90 \
  --threads auto

📝 8. Input requirements

8.1 SILVA FASTA

SILVA FASTA headers should contain an accession followed by semicolon-separated taxonomy:

>AB000001.1.1500 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus;Lactobacillus acidophilus;
ACGT...

8.2 SILVA metadata

The metadata table must contain at least:

acc
flags

Accessions whose flags column contains [T] or [t] are extracted as type-strain references.

8.3 User FASTA

The first token of each header is treated as the sequence ID:

>seq000001 optional description
ACGT...

🛠️ 9. Development

Install in editable mode and run tests:

python -m pip install -e .
python -m pip install pytest
pytest

Check syntax quickly:

python -m py_compile autotax2/cli.py autotax2/prepare.py

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs		docs
examples		examples
src		src
tests		tests
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 AutoTax2

🎯 Design goals

📚 Table of Contents

📦 1. Installation

📋 2. Command overview

🗄️ 3. Prepare SILVA

🔍 4. Optional intron detection

🔗 5. Insert sequences into the SILVA backbone

📊 6. Multi-reference overlap

⚙️ 7. Classic helper workflow

📝 8. Input requirements

8.1 SILVA FASTA

8.2 SILVA metadata

8.3 User FASTA

🛠️ 9. Development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧬 AutoTax2

🎯 Design goals

📚 Table of Contents

📦 1. Installation

📋 2. Command overview

🗄️ 3. Prepare SILVA

🔍 4. Optional intron detection

🔗 5. Insert sequences into the SILVA backbone

📊 6. Multi-reference overlap

⚙️ 7. Classic helper workflow

📝 8. Input requirements

8.1 SILVA FASTA

8.2 SILVA metadata

8.3 User FASTA

🛠️ 9. Development

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages