Skip to content

Automated assembly

Ryan Wick edited this page Oct 10, 2022 · 8 revisions

This tutorial assumes that you are aiming for the best possible bacterial genome assembly, and that means some human judgment and intervention is needed. This is particularly true for the Trycycler assembly step and Checking differences in IGV.

However, in many contexts, an automated assembly workflow is needed, and this page gives my recommendations. This automated workflow differs from the main tutorial by using Flye instead of Trycycler for long-read assembly and limiting short-read polishing to the tools with the lowest rates of introduced errors (Polypolish and POLCA).

While this assembly method can often work very well, you should not assume the results are error-free. In particular, structural assembly errors (e.g. fragmented replicons, doubled plasmids) are possible, as these are what Trycycler aims to avoid.

Requirements

Set variables

The commands below use these Bash variables. Set them as appropriate for your data/genome/system:

i1=reads/S_aureus_JKD6159_Illumina_1.fastq.gz
i2=reads/S_aureus_JKD6159_Illumina_2.fastq.gz
ont=reads/S_aureus_JKD6159_ONT_R10.4_guppy_v6.1.7.fastq.gz
medaka_model=r104_e81_sup_g610
genome_size=3000000
threads=16

Read QC

mkdir reads_qc
fastp --in1 "$i1" --in2 "$i2" --out1 reads_qc/illumina_1.fastq.gz --out2 reads_qc/illumina_2.fastq.gz
mv fastp.* reads_qc
filtlong --target_bases "$genome_size"00 "$ont" > reads_qc/ont.fastq   # aim for 100x depth

Flye assembly

flye --nano-hq reads_qc/ont.fastq --threads "$threads" --out-dir flye
rotate_circular_gfa.py flye/assembly_graph.gfa > flye.fasta
rm -r flye  # clean up

The rotate_circular_gfa.py script converts the Flye GFA file to FASTA format, rotating any circular contigs by half their length. This serves to move any missing/duplicated bases at the start/end of contigs into the middle of the sequence where polishing can fix the error.

Medaka polishing

medaka_consensus -i reads_qc/ont.fastq -d flye.fasta -o medaka -m "$medaka_model"
seqtk seq medaka/consensus.fasta > flye_medaka.fasta
rm -r medaka *.fai *.mmi  # clean up

Polypolish polishing

bwa index flye_medaka.fasta
bwa mem -t "$threads" -a flye_medaka.fasta reads_qc/illumina_1.fastq.gz > alignments_1.sam
bwa mem -t "$threads" -a flye_medaka.fasta reads_qc/illumina_2.fastq.gz > alignments_2.sam
polypolish_insert_filter.py --in1 alignments_1.sam --in2 alignments_2.sam --out1 filtered_1.sam --out2 filtered_2.sam
polypolish flye_medaka.fasta filtered_1.sam filtered_2.sam > flye_medaka_polypolish.fasta
rm *.bwt *.pac *.ann *.amb *.sa *.sam  # clean up

POLCA polishing

polca.sh -a flye_medaka_polypolish.fasta -r reads_qc/illumina_1.fastq.gz" "reads_qc/illumina_2.fastq.gz -t "$threads"
seqtk seq flye_medaka_polypolish.fasta.PolcaCorrected.fa | paste - - | sort | tr '\t' '\n' > flye_medaka_polypolish_polca.fasta  # restore contig order
rm *PolcaCorrected.fa *.amb *.ann *.bai *.bam *.batches *.bwt *.fai *.names *.pac *.report *.sa *.sam *.success *.vcf *.err  # clean up

The resulting flye_medaka_polypolish_polca.fasta file contains the final assembly.