Skip to content

Alignment tree workflow

Ryan Wick edited this page May 30, 2023 · 3 revisions

This page described how to use Verticall to produce a distance-based tree. This workflow is best suited to collections of closely related isolates which are too large for Gubbins.

Some key points about the alignment tree workflow:

  • It requires a whole-genome pseudo-alignment, and is thus better suited to closely-related groups of isolates. If your isolates are diverse, consider using the distance tree workflow instead.
  • It is fast and scales well with the number of isolates: O(n). So this workflow is suitable for very large datasets containing thousands of isolates.
  • It filters out recombination in only one way: by painting the contigs and ignoring horizontal parts relative to a reference. This means it is not as sensitive as the distance tree workflow or Gubbins.

Requirements

  • An assembly for each of your genomes in FASTA format.
    • Put this in a single directory (the instructions below assume this directory is named assemblies).
    • Sample names are taken from the assembly filenames: e.g. Sample_123.fasta is good, assembly.fasta is bad.
    • The assemblies cannot contain ambiguous bases. You can use Verticall repair to split contigs at ambiguous bases if necessary.
    • Good assemblies (e.g. with a big N50) are better, but fragmented assemblies are okay.
  • A reference sequence in FASTA format.
    • The reference must contain only one sequence and cannot contain ambiguous bases.
  • A whole-genome pseudo-alignment in FASTA format.
    • This must include the reference sequence as well as each isolate.
    • The reference sequence name in the alignment must match the reference sequence filename. E.g. if your reference sequence filename is reference.fasta then the alignment must contain >reference.
    • The alignment must span the full reference sequence – i.e. both variant and invariant sites must be included.
    • If you have reads for your genomes, a clean.full.aln alignment file from Snippy works well.
    • If you don't have reads for your genomes, you can use the generate_ska_alignment.py script in Gubbins.

Step 1: pairwise comparisons

This Verticall pairwise command will perform pairwise comparisons of each assembly to the reference:

verticall pairwise -i assemblies -o verticall.tsv -r reference.fasta

Unlike in the distance tree workflow, this will not perform all pairwise comparisons between assemblies. This means it scales O(n) with the number of assemblies, so it should be fast even for large collections.

Step 2: masked alignment

This Verticall mask command will mask horizontal regions and unaligned regions in the pseudo-alignment:

verticall mask -i verticall.tsv -a alignment.fasta -o masked_alignment.fasta

If you want to exclude the reference genome from the masked aligned, use --exclude_reference. If you want to exclude invariant sites from the alignment (can make tree-building faster for big collections), use --exclude_invariant.

Step 3: tree

Here's a minimal tree-building command with IQ-TREE:

iqtree2 -s masked_alignment.fasta

RAxML-NG and FastTree are other options.