Skip to content

ypchan/autotax2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧬 AutoTax2

AutoTax2 is a vsearch-based workflow for SILVA-backed SSU rRNA sequence processing, taxonomy assignment, rank-wise clustering, reference extension, and source-overlap analysis.

Version: 0.1

🎯 Design goals

AutoTax2 is designed around a fixed SILVA backbone.

  1. SILVA remains the backbone and is not re-clustered together with user sequences.
  2. User datasets and extension references are first inserted into the SILVA framework.
  3. Sequences that match SILVA above a rank-specific identity threshold inherit the corresponding SILVA-backed taxon assignment.
  4. Sequences that do not match SILVA above a threshold are clustered within the extension dataset and reported as novel rank-like taxa.
  5. VSEARCH performs the search and clustering work; Python handles file preparation, metadata parsing, workflow logic, and result summaries.
  6. AutoTax2 does not download SILVA automatically. Users provide local SILVA FASTA and metadata files.

📚 Table of Contents

📦 1. Installation

AutoTax2 is a Python package. Install it into an existing Python environment.

gh repo clone ypchan/autotax2
python -m pip install -e .

VSEARCH must be available on PATH, or you must provide its path with --vsearch.

Check the installation:

autotax2 --help
autotax2 check
vsearch --version

📋 2. Command overview

autotax2 --help

Available commands:

Command Description
check Check external dependencies and optional reference files
prepare-silva Prepare local SILVA FASTA and metadata files
detect-intron Detect intron-like insertions and write analysis FASTA files
insert-backbone Insert extension sequences into the SILVA backbone
overlap-backbone Summarize taxon overlap across backbone assignment files
derep Run VSEARCH dereplication
cluster Run VSEARCH clustering at one or more identity levels
classify Classify representative sequences using SINTAX plus SILVA/type-strain hits
assign Assign new sequences to old centroids or create new clusters
provenance Summarize source composition from UC clustering levels
summarize Summarize existing classify outputs
run Run dereplication, clustering and classification in one workflow

Every command supports:

autotax2 <command> --help
autotax2 <command> --example

🗄️ 3. Prepare SILVA

AutoTax2 expects local SILVA files. You can download them from the SILVA website.

Download the SILVA files:

wget https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/full_metadata/SILVA_138.2_SSURef_Nr99.full_metadata.gz
wget https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/full_metadata/SILVA_138.2_SSURef_Nr99.full_metadata.gz.md5
wget https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/SILVA_138.2_SSURef_NR99_tax_silva.fasta.gz
wget https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/SILVA_138.2_SSURef_NR99_tax_silva.fasta.gz.md5

Verify the downloads:

md5sum -c SILVA_138.2_SSURef.full_metadata.gz.md5
md5sum -c SILVA_138.2_SSURef_NR99_tax_silva.fasta.gz.md5

Required files:

  • SILVA_138.2_SSURef_NR99_tax_silva.fasta.gz
  • SILVA_138.2_SSURef.full_metadata.gz

Prepare a local reference directory:

autotax2 prepare-silva \
  --silva-fasta SILVA_138.2_SSURef_NR99_tax_silva.fasta.gz \
  --silva-metadata SILVA_138.2_SSURef.full_metadata.gz \
  --out refdatabases \
  --make-udb \
  --threads auto

Note: prepare-silva cleans the input FASTA before downstream processing. The default behavior is strict:

  • U/u is converted to T
  • Only A/C/G/T is allowed after conversion
  • Records containing N, ambiguity codes, gaps, dots, or any other non-ACGT character are dropped
  • A summary report and a dropped-record report are generated

Output directory structure:

refdatabases/
  ├── <prefix>.fasta
  ├── <prefix>.udb
  ├── <prefix>_sintax.fasta
  ├── <prefix>_sintax.udb
  ├── <prefix>_typestrains.fasta
  ├── <prefix>_typestrains.udb
  ├── <prefix>_raw_input.fasta
  ├── <prefix>_fasta_cleaning_summary.tsv
  ├── <prefix>_fasta_cleaning_dropped.tsv
  ├── silva_taxonomy.tsv
  ├── typestrains_accessionIDs.txt
  └── autotax2_ref_manifest.tsv

🔍 4. Optional intron detection

Use this when long-read 16S sequences may contain intron-like insertions.

autotax2 detect-intron \
  --input hifimeta.original.fa \
  --db refdatabases/<prefix>.udb \
  --source-label hifimeta \
  --out hifimeta_intron \
  --search-id 0.70 \
  --rescue-id 0.987 \
  --min-intron-len 50 \
  --min-flank-len 150 \
  --threads auto

Important outputs:

  • hifimeta_intron/analysis_sequences.fa - Analysis FASTA for matching and clustering
  • hifimeta_intron/sequence_version_map.tsv - Sequence mapping file

🔗 5. Insert sequences into the SILVA backbone

With intron detection:

autotax2 insert-backbone \
  --input hifimeta_intron/analysis_sequences.fa \
  --original-fasta hifimeta.original.fa \
  --version-map hifimeta_intron/sequence_version_map.tsv \
  --source-label hifimeta \
  --silva-manifest refdatabases/autotax2_ref_manifest.tsv \
  --rank-thresholds default \
  --db-format auto \
  --out hifimeta_inserted \
  --threads auto

Without intron detection:

autotax2 insert-backbone \
  --input hifimeta.original.fa \
  --original-fasta hifimeta.original.fa \
  --source-label hifimeta \
  --silva-manifest refdatabases/autotax2_ref_manifest.tsv \
  --rank-thresholds default \
  --db-format auto \
  --out hifimeta_inserted \
  --threads auto

Common outputs:

hifimeta_inserted/
  ├── sequence_rank_assignment.tsv
  ├── rank_taxa_summary.tsv
  ├── rank_uc/
  ├── rank_centroids_core/
  └── rank_centroids_original/

📊 6. Multi-reference overlap

Run insert-backbone for each reference dataset, then compare assignments:

autotax2 overlap-backbone \
  --assignments \
    ref2_inserted/sequence_rank_assignment.tsv \
    ref3_inserted/sequence_rank_assignment.tsv \
    ref4_inserted/sequence_rank_assignment.tsv \
    hifimeta_inserted/sequence_rank_assignment.tsv \
  --labels ref2,ref3,ref4,hifimeta \
  --out ref_overlap

Typical outputs:

ref_overlap/
  ├── taxon_presence_by_source.tsv
  ├── taxon_count_by_source.tsv
  ├── source_pairwise_overlap_by_rank.tsv
  └── source_unique_taxa_by_rank.tsv

⚙️ 7. Classic helper workflow

Dereplicate:

autotax2 derep \
  --input input.fa \
  --out work \
  --sort \
  --threads auto

Cluster:

autotax2 cluster \
  --input work/derep_sorted.fa \
  --out clusters \
  --ids 0.99,0.97 \
  --threads auto

Classify:

autotax2 classify \
  --input clusters/otu099_centroids.fa \
  --ref-manifest refdatabases/autotax2_ref_manifest.tsv \
  --out classify \
  --threads auto

Or run end to end:

autotax2 run \
  --input input.fa \
  --ref-manifest refdatabases/autotax2_ref_manifest.tsv \
  --out autotax2_out \
  --ids 0.99,0.97,0.95,0.90 \
  --threads auto

📝 8. Input requirements

8.1 SILVA FASTA

SILVA FASTA headers should contain an accession followed by semicolon-separated taxonomy:

>AB000001.1.1500 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus;Lactobacillus acidophilus;
ACGT...

8.2 SILVA metadata

The metadata table must contain at least:

acc
flags

Accessions whose flags column contains [T] or [t] are extracted as type-strain references.

8.3 User FASTA

The first token of each header is treated as the sequence ID:

>seq000001 optional description
ACGT...

🛠️ 9. Development

Install in editable mode and run tests:

python -m pip install -e .
python -m pip install pytest
pytest

Check syntax quickly:

python -m py_compile autotax2/cli.py autotax2/prepare.py

About

AutoTax2 is a VSEARCH-based, SILVA-backbone-guided workflow for near/full-length rRNA classification, rank-aware clustering, intron-aware sequence rescue, and reference overlap analysis.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages