Skip to content

FAQ and miscellaneous tips

Ryan Wick edited this page Aug 21, 2023 · 12 revisions

Should I use the distance tree workflow or alignment tree workflow?

Short answer: if your dataset is very large and very closely related, then the alignment tree workflow is probably best. Otherwise, I'd recommend the distance tree workflow.

Advantages to the distance tree workflow:

  • Usually does a better job at filtering out the effect of recombination (in my experience).
  • Suitable for very diverse datasets (e.g. spanning a genus).

Advantages to the alignment tree workflow:

  • Faster, especially for large datasets.
  • Allows for robust alignment-based tree inference, e.g. with IQ-TREE.

Should I use Verticall or Gubbins?

If you have a dataset that is a good fit for Gubbins (very closely related and not too many genomes), then Gubbins will probably deliver better results. Gubbins is sensitive to small regions of recombination (e.g. 100 bp) whereas Verticall's sliding-window approach means that it will only identify larger regions of recombination.

Verticall is the better choice with datasets outside Gubbins' niche. This can mean datasets too large for Gubbins (using the alignment tree workflow) or datasets too diverse for Gubbins (using the distance tree workflow).

Can I use Verticall on viral genomes?

I'm not sure (haven't tried), but probably not. In order to build its distance distribution (see Pairwise assembly comparison), Verticall needs a large number of sliding windows, and I fear that most viral genomes are too short to provide enough. You're welcome to try, but be prepared to fiddle with parameters, and use Verticall view to sanity-check its behaviour.

Do Verticall results depend on the genome alignment parameters?

I haven't explicitly tested this, but I would guess not much. I tried to choose minimap2 parameters that work for a wide range of scenarios: indexing with -k15 -w10 and aligning with -x asm20. But you're free to use the --index_options and --align_options settings to experiment with different minimap2 parameters.

Fine-tuning parameters

I've tried to set up the Verticall pairwise parameters to work well for a broad range of bacterial genomes, but you are free to play with these parameters yourself. In particular, you can adjust --window_count/--window_size to change Verticall's sliding window size and --smoothing_factor to change how much it smooths the distribution (see Pairwise assembly comparison) for details.

If you go down the parameter-tuning road, I recommend using Verticall view to visualise the effect of your settings.

Adding genomes to an existing pairwise analysis

Imagine you've done a large pairwise analysis (e.g. 500 genomes, 249500 pairwise comparisons)...

verticall pairwise -i assemblies -o verticall.tsv
verticall matrix -i verticall.tsv -o verticall.phylip
fastme --method B --nni B --spr -i verticall.phylip -o verticall.newick

...and you now have 10 more genomes you'd like to add. You don't have to do the entire analysis with all 510 genomes (259590 pairwise comparisons), which might take a while. Instead, you can use the --existing_tsv option to tell Verticall which assembly pairs to skip, and it will only do the analysis for the new assembly pairs (10090 pairwise comparisons).

To do this, add the 10 new assemblies to your assemblies directory, and then:

verticall pairwise -i assemblies -o verticall_new.tsv --existing_tsv verticall.tsv
tail -n+2 verticall_new.tsv >> verticall.tsv  # combine the TSV files excluding the second file's header
rm verticall_new.tsv
verticall matrix -i verticall.tsv -o verticall.phylip
fastme --method B --nni B --spr -i verticall.phylip -o verticall.newick