Skip to content
Ryan Wick edited this page Aug 9, 2022 · 34 revisions

Trycycler

The problem

Long-read assembly has come a long way in the last few years, and there are many good assemblers available, including Canu, Flye, Raven and Redbean. Since bacterial genomes are relatively simple (not too large and not too many repeats), a completed assembly (one contig per replicon) is often possible when assembling long reads.

But even the best assemblers are not perfect! They often fail to circularise sequences, either duplicating or omitting sequence at the start/end of a contig. They sometimes produce spurious contigs, e.g. assembling a repetitive part of the chromosome into a separate contig. They sometimes omit entire replicons, e.g. failing to include a plasmid. They sometimes create medium-scale indel errors, e.g. deleting 50 bp from the genome. And they occasionally create large-scale misassemblies, e.g. a significant structural rearrangement. Check out our paper comparing long-read assemblers for a more in-depth look at how they perform.

So imagine that you've done long-read sequencing of a bacterial isolate and assembled the reads. The result looks like a nice completed assembly (e.g. a big circular contig for the chromosome and a couple of smaller circular contigs for plasmids), but how can you be sure that it's free from the kinds of problems listed above?

The solution

Trycycler is a tool that takes as input multiple separate long-read assemblies of the same genome (e.g. from different assemblers or different read subsets) and produces a consensus long-read assembly.

In brief, Trycycler does the following:

  • Clusters the contig sequences, so the user can distinguish complete contigs (i.e. those that correspond to an entire replicon) from spurious and/or incomplete contigs.
  • Reconciles the alternative contig sequences with each other and repairs circularisation issues.
  • Performs a multiple sequence alignment (MSA) of the alternative sequences.
  • Constructs a consensus sequence from the MSA by choosing between variants where the sequences differ.

The result is a long-read assembly you can trust!

An important caveat

Trycycler does not ensure a perfect assembly of the underlying genome, because systematic basecalling errors can create small-scale sequence errors. Incorrect homopolymer lengths are a common example of this problem, e.g. AAAAAAAA becoming AAAAAAA (read more here).

But if all goes well when running Trycycler, small-scale errors will be the only type of error in its consensus long-read assembly. You can then polish your Trycycler assembly to repair these small-scale errors, e.g. long-read polishing with Medaka then short-read polishing with Polypolish. A Trycycler+Medaka+Polypolish approach to assembly can therefore yield the best possible bacterial genome: Trycycler fixes the medium-to-large-scale errors while Medaka and Polypolish fix the small-scale errors.

Where to begin?

Are you new to Trycycler and interested in trying it out? If so, you'll first need to get it installed, so check out the Software requirements and Installation pages.

After that, I'd recommend that you read the documentation for How to run Trycycler and look at the Illustrated pipeline overview. Since running Trycycler often involves manual intervention and judgement calls, it is useful to have a good understanding of how it works. Don't worry – nothing about Trycycler's approach is particularly complicated!

Finally, I'd suggest that you practise using Trycycler on the provided Demo datasets. I've included my analysis of those datasets (commands, outputs and interventions) so you can attempt a Trycycler assembly and then compare your results to mine.

If you want to read more or need to cite Trycycler, here is its corresponding manuscript: Wick RR, Judd LM, Cerdeira LT, Hawkey J, Méric G, Vezina B, Wyres KL, Holt KE. Trycycler: consensus long-read assemblies for bacterial genomes. Genome Biology. 2021. doi:10.1186/s13059-021-02483-z.

Clone this wiki locally