Skip to content

Distance tree workflow

Ryan Wick edited this page May 30, 2023 · 8 revisions

This page described how to use Verticall to produce a distance-based tree. This workflow is best suited to collections of isolates which are too diverse for Gubbins.

Some key points about the distance tree workflow:

  • It works with both closely related groups of isolates and highly diverse isolates. However, distance-based trees are not ideal for very closely related groups (e.g. 10s of SNPs between isolates).
  • It scales with the square of the number of isolates: O(n2). Small collections should be fast, but large collections (e.g. thousands of isolates) will take a lot of time and/or CPUs.
  • It filters out recombination in two ways: 1) by painting the alignments and ignoring horizontal parts and 2) by using the median sliding-window distance which is robust to outliers. This makes it more sensitive than the alignment tree workflow.

Requirements

  • An assembly for each of your genomes in FASTA format.
    • Put this in a single directory (the instructions below assume this directory is named assemblies).
    • Sample names are taken from the assembly filenames: e.g. Sample_123.fasta is good, assembly.fasta is bad.
    • The assemblies cannot contain ambiguous bases. You can use Verticall repair to split contigs at ambiguous bases if necessary.
    • Good assemblies (e.g. with a big N50) are better, but fragmented assemblies are okay.

Step 1: pairwise comparisons

This command will perform pairwise comparisons of each assembly to the reference with Verticall pairwise:

verticall pairwise -i assemblies -o verticall.tsv

This scales O(n2) with the number of assemblies, so it may take a long time for large collections. If you have a big computing cluster, you can parallelise the work with the --part option like this:

# First run this to completion to ensure all alignment indices are built:
verticall pairwise -i assemblies -o verticall.tsv --index_only

# Then launch jobs:
for i in {001..100}; do
    job_scheduler "verticall pairwise -i assemblies -o verticall_"$i".tsv --part "$i"/100 --skip_check"
done

# When the jobs are finished, merge the results together:
cat verticall_*.tsv > verticall.tsv

Step 2: distance matrix

This Verticall matrix command will produce a PHYLIP distance matrix from the TSV file:

verticall matrix -i verticall.tsv -o verticall.phylip

Step 3: tree

Here's a minimal tree-building command with FastME:

fastme -i verticall.phylip -o verticall.newick

See Distance based tree methods for more info on building trees from a distance matrix.