Skip to content
Ryan Wick edited this page Jan 19, 2024 · 25 revisions

Polypolish

The problem

While long-read sequencing platforms (like Oxford Nanopore) have gotten a lot better in recent years, long-read-only assemblies still suffer from some consensus sequence errors. Homopolymer-length errors are the most common type, e.g. AAAAAAAA becoming AAAAAAA. One can use short reads (like from an Illumina platform) to correct errors in a long-read assembly, a process known as short-read polishing. There are a number of short-read polishing tools, including FMLRC2, HyPo, NextPolish, ntEdit, Pilon, POLCA and Racon.

However, errors in repeat sequences can be difficult to fix. Most of those short-read polishing tools rely on alignments generated from tools like BWA-MEM. When run with default settings, aligners put each read in a single best location (randomly chosen in the case of a tie). So if the assembly has an error in a repeat, reads may not align to it because they can get a better alignment in other instances of the repeat. For example, consider a genome with a two-copy exact repeat (I'll call them copy A and copy B), and the assembly of this genome has an error in copy A. When aligning short reads to the assembly, all reads which originated from the error-containing region of copy A will instead align to the corresponding region of copy B, because they can achieve a more accurate alignment there. This leaves no reads aligned over the error, and a short-read polishing tool will therefore have no information with which to fix the error.

Long-read assemblies with short-read polishing can be very accurate, and their non-repeat sequences may in fact be perfect. But due to the scenario described above, errors often remain in repeat sequences. This problems keeps truly error-free genome assemblies out of reach.

The solution

Polypolish is a short-read polishing tool that differs from existing tools in an important way: it uses short-read alignments where each read is aligned to all possible locations. This means that errors in repeats will be covered by short-read alignments, and Polypolish can therefore fix those errors. For an illustrated walk-through of how it works, check out the Toy example page of this wiki.

In addition to its ability to polish repeats, Polypolish is very conservative: it only changes a locus when the evidence is very strong, otherwise opting to make no change. This means that Polypolish is very unlikely to introduce new errors into an assembly.

Some caveats

No polishing tool is perfect, Polypolish included. This means that you should make your long-read assembly as accurate as possible before doing any short-read polishing. Trycycler can deliver clean assemblies which are free from medium-to-large scale errors.

Since different short-read polishers use different algorithms, I have found that using a combination of tools can deliver the most accurate assemblies. Aside from Polypolish, my favourite polisher is pypolca – try using it in addition to Polypolish.

Nothing about Polypolish is intrinsically specific to bacterial genomes – its approach should work on eukaryotes too. However, I've only ever used it on bacterial genomes and small eukaryote genomes. Large repeat-rich eukaryote genomes might cause issues (see this question in the FAQ), so try at your own risk!

Where to begin?

Check out the Software requirements and Installation pages to get Polypolish up and running. Then the How to run Polypolish page will show you how to use it.

If you want to read more or need to cite Polypolish, here is its corresponding manuscript: Wick RR, Holt KE. Polypolish: short-read polishing of long-read bacterial genome assemblies. PLOS Computational Biology. 2022. doi:10.1371/journal.pcbi.1009802.