Skip to content

Coalescent-Aware Species Tree Length Estimation in Substitution-units

License

Notifications You must be signed in to change notification settings

ytabatabaee/CASTLES

Repository files navigation

CASTLES

CASTLES is a method for estimating branch lengths of a given species tree from estimated gene trees in the unit of expected number of substitutions per sequence site (substitution units), that addresses gene tree heterogeneity due to incomplete lineage sorting (ILS), as modeled by the multi-species coalescent (MSC) model.

The CASTLES algorithm is described in the following paper:

Y. Tabatabaee, C. Zhang, T. Warnow, S. Mirarab, Phylogenomic branch length estimation using quartets, Bioinformatics, Volume 39, Issue Supplement_1, June 2023, Pages i185–i193, https://doi.org/10.1093/bioinformatics/btad221

Datasets and results from this study are available in CASTLES-paper repository.

Integrated inside ASTER (NEW)

An improved version of CASTLES that also handles gene duplication and loss (called CASTLES-Pro) is now integerated inside the species tree estimation software ASTER. We recommend using ASTER directly to get SU branch lengths on trees produced by ASTRAL, ASTRAL-Pro, or on a fixed input species tree topology.

Usage Instructions

Follow the installation instructions on ASTER repository and download ASTER (>= v1.16.2.4). Using the following compilation, ASTRAL and ASTRAL-Pro by default produce species trees with SU branch lengths. See ASTRAL tutorial and ASTRAL-Pro tutorial for more information.

Compilation

For single-copy gene trees, use the following command to compile ASTER

$ g++ -D ASTRALIV -std=gnu++11 -march=native -Ofast -pthread src/astral.cpp -o bin/astral4

For multi-copy gene trees, use the following command for compilation

$ g++ -D CASTLES -std=gnu++11 -march=native -Ofast -pthread src/astral-pro.cpp -o bin/astral-pro2

Estimating branch lengths

Arguments

Required
 -i,  --input        input gene trees in newick format       
 -o,  --output       output species tree with SU branch lengths
Optional
 -i,  --genelength   average gene sequence length [default: 1000]   
 -o,  --root         outgroup name
 -a,  --mapping      list of gene name to taxon name maps
 -c,  --constraint   species tree to score 

To infer a species tree using ASTRAL with SU branch lengths, use the following command:

$ bin/astral4 -i <gene-tree-path> -o <output-path> [--root <outgroup-name>] [--genelength <gene-length>]

To infer branch lengths on a fixed input species tree topology, use the scoring option -C -c <species-tree-path>:

$ bin/astral4 -i <gene-tree-path> -C -c <species-tree-path> -o <output-path> [--root <outgroup-name>] [--genelength <gene-length>]

To infer branch lengths using ASTRAL-Pro, use the following command:

$ bin/astral-pro2 -i <gene-tree-path > [-C -c <species-tree-path>] -o <output-path> [--root <outgroup-name>] [--genelength <gene-length>]

If an outgroup is known, it is recommded to specify it using the option --root. Additionally, if the average gene sequence length that was used to infer the input gene trees is approximately known, it can be set with --genelength. The default value for this parameter is 1000.

Handling multiple individuals per species

When there are multiple individuals per species and the individual names do not match the species names, use the following command

$ bin/astral4 -i <gene-tree-path> [-C -c <species-tree-path>] -a <name_map> -o <output-path>

where the name_map file contains maps from individual names to species names in the following format

individual_name1    species_name1
individual_name2    species_name2
individual_name3    species_name3
...

Additional Files

  • An old documentation of CASTLES is available here.
  • A modified version of the simulation software SimPhy that produces species trees with SU branch lenghts is available in simulation_files.
  • Some useful scripts for comparing branch lengths of two species trees are available in scripts.
  • See su_branch_calcs.nb for theoretical derivations of SU branch lengths under the MSC.