# SESSION_4

Prerequisites: In a terminal, You need to create, install biopython and activate the `Conda` env as follow before to start jupyter

**We will create a new env called curso_4**

!conda create -y --name curso_4

!conda install -y -n curso_4 -c bioconda -c conda-forge mummer gepard racon nanopolish minimap2 jupyter

**for MacOs users: you may have problem with mummer v.3 installed using conda**.  If you have troubles, you should install mummer4 using a regulation installation, see here :  (https://mummer4.github.io/install/install.html)

!conda activate curso_4

!jupyter notebook &

# Polishing Assemblies

What is “genome polishing”?

“Genome polishing,” sometimes referred to as “genome finishing,” is a workflow in which assembly software searches for local misassemblies and other inconsistencies in a draft genome assembly and then corrects them. Genome polishing can be used to create hybrid assemblies with Illumina data and long read sequencing data and is especially valuable for enhancing assembly results where there are concerns about single molecule or nanopore sequencing accuracy. (https://www.dnastar.com/blog/dnastar-news/genome-polishing-benchmarks/)

In [None]:
from IPython.display import YouTubeVideo

YouTubeVideo('jCVkHq9dlGs', 560, 315)

# CALCULATING DIFFERENCES BETWEEN ASSEMBLIES

In the previous session we assembled the b. subtilis genome without  error correction of reads. The accuracy of such assemblies equals to the base accuracy of sequenced reads and is unusable for many downstream analyses. 
Let us check the actual error with `dnadiff`.

We can now run `dnadiff` from the `mummer` package, which is a tool for calculating differences between two genome and providing a detailed summary. Try with the different assembly done at session_3.  
(see http://mummer.sourceforge.net)

In [1]:
!dnadiff -h


  USAGE: dnadiff  [options]  <reference>  <query>
    or   dnadiff  [options]  -d <delta file>

  DESCRIPTION:
    Run comparative analysis of two sequence sets using nucmer and its
    associated utilities with recommended parameters. See MUMmer
    documentation for a more detailed description of the
    output. Produces the following output files:

    .report  - Summary of alignments, differences and SNPs
    .delta   - Standard nucmer alignment output
    .1delta  - 1-to-1 alignment from delta-filter -1
    .mdelta  - M-to-M alignment from delta-filter -m
    .1coords - 1-to-1 coordinates from show-coords -THrcl .1delta
    .mcoords - M-to-M coordinates from show-coords -THrcl .mdelta
    .snps    - SNPs from show-snps -rlTHC .1delta
    .rdiff   - Classified ref breakpoints from show-diff -rH .mdelta
    .qdiff   - Classified qry breakpoints from show-diff -qH .mdelta
    .unref   - Unaligned reference IDs and lengths (if applicable)
    .unqry   - Unaligned 

In [None]:
!dnadiff -p bs_assembly_miniasm data/bacillus_subtilis/bs_ref.fasta data/bacillus_subtilis/bs_assembly_miniasm.fasta

In [None]:
!cat bs_assembly_miniasm.report

When we mapped all our reads to the reference genome in session 2, we saw that numerous reads cover each base in the reference. 
This information can be used to ammend errors that happened during sequencing and basecalling by aligning all the sequences to our assembly. The general idea is to create an alignment pile of all reads from which we can infer the most frequent base at each position in the assembly.
They are differents polishing tools for corecting raw contigs, using as an entry the nanopore reads or illumina reads to perform corrections.
The popular polishing tool is racon (https://github.com/isovic/racon); medaka (https://github.com/nanoporetech/medaka); and nanopolish (https://github.com/jts/nanopolish)

For a fixed set of parameters, `racon` will ultimately hit the maximal accuracy value after a couple of iterations. To increase accuracy even further, we need to use a different algorithm or even additional information. We have two options, use `medaka` from Oxford Nanopore Technologies which uses deep neural networks trained on `racon` output using only basecalled reads, or use `nanopolish` which uses the raw signal data to increase the accuracy with hidden Markov models. First we will try `medaka` which is much faster than `nanopolish`, and afterwards we will showcase how to use `nanopolish`.

# POLISHING WITH RACON

The consensus module `racon` was developed atop `minimap` and `miniasm` assembly pipeline as the consensus module for third generation sequencing data. The core engine of `racon` is a partial order alignment library called `spoa`.  
(see https://github.com/isovic/racon)

`racon` was designed to iteratively increase the accuracy of a target sequence by first using a mapper (`minimap2`) to map/align all the reads to the target. Afterwards, it filters out low quality overlaps, slices the target sequence into windows of 500 bp, drops read parts that do not pass a quality threshold and construct a multiple sequence alignment from each window. After calling consensus on each window, the final sequence is obtained by concatenating all window consensuses.  

(see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5411768/)

We are going to use racon to do an initial correction. The medaka documentation advises to do four rounds with racon before polishing with medaka since medaka has been trained with racon polished assemblies. 

(see https://denbi-nanopore-training-course.readthedocs.io/en/latest/polishing/index.html)


In [2]:
from IPython.display import Image
Image(url ="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5411768/bin/737f03.jpg")

In [None]:
!racon

Let us run first `minimap2` to find the positions where our reads map to the assembly and then we will use `racon` to increase the accuracy. Eventually running `dnadiff` will give us the details about the differences against the reference genome.  
**Do 4 rounds of Racon**.

**Round 1**

In [None]:
!minimap2 \
    -t 4 \
    -x map-ont \
    bs_assembly_miniasm.fasta \
    data/bacillus_subtilis/bs_reads.fastq.gz > bs_assembly_miniasm.paf

!racon \
    -t 4 \
    data/bacillus_subtilis/bs_reads.fastq.gz \
    bs_assembly_miniasm.paf \
    bs_assembly_miniasm.fasta > bs_assembly_miniasm_r1.fasta

In [None]:
!dnadiff \
    -p bs_assembly_miniasm_r1 \
    data/bacillus_subtilis/bs_ref.fasta \
    bs_assembly_miniasm_r1.fasta 2> err

In [None]:
!cat bs_assembly_miniasm_r1.report

With only one `racon` iteration the accuracy increased from $85.28\%$ to $98.76\%$. In addition, the number of break points decreased drastically, the assembly length almost matches the actual reference length, and almost no unalignable bases are left.

We will run a few more iterations to see how far we can increase the accuracy.  
Do several iterations of `Racon`.

**Round 2**

In [None]:
!minimap2 \
    -t 4 \
    -x map-ont \
    bs_assembly_miniasm_r1.fasta \
    data/bacillus_subtilis/bs_reads.fastq.gz > bs_assembly_miniasm_r1.paf

!racon \
    -t 4 \
    data/bacillus_subtilis/bs_reads.fastq.gz \
   bs_assembly_miniasm_r1.paf \
    bs_assembly_miniasm_r1.fasta > bs_assembly_miniasm_r2.fasta

In [None]:
!dnadiff \
    -p bs_assembly_raven_2 \
    data/bacillus_subtilis/bs_ref.fasta \
    bs_assembly_miniasm_r2.fasta 2> err

!cat bs_assembly_miniasm_r2.report

**Round 3**

In [None]:
!minimap2 \
    -t 4 \
    -x map-ont \
    bs_assembly_miniasm_r2.fasta \
    data/bacillus_subtilis/bs_reads.fastq.gz > bs_assembly_miniasm_r2.paf

!racon \
    -t 4 \
    data/bacillus_subtilis/bs_reads.fastq.gz \
   bs_assembly_miniasm_r2.paf \
    bs_assembly_miniasm_r2.fasta > bs_assembly_miniasm_r3.fasta

In [None]:
!dnadiff \
    -p bs_assembly_raven_3 \
    data/bacillus_subtilis/bs_ref.fasta \
    bs_assembly_miniasm_r3.fasta 2> err

!cat bs_assembly_miniasm_r3.report

**Round 4**

In [None]:
!minimap2 \
    -t 4 \
    -x map-ont \
    bs_assembly_miniasm_r3.fasta \
    data/bacillus_subtilis/bs_reads.fastq.gz > bs_assembly_miniasm_r3.paf

!racon \
    -t 4 \
    data/bacillus_subtilis/bs_reads.fastq.gz \
   bs_assembly_miniasm_r3.paf \
    bs_assembly_miniasm_r3.fasta > bs_assembly_miniasm_r4.fasta

In [None]:
!dnadiff \
    -p bs_assembly_raven_4 \
    data/bacillus_subtilis/bs_ref.fasta \
    bs_assembly_miniasm_r4.fasta 2> err

!cat bs_assembly_miniasm_r4.report

# POLISHING WITH MEDAKA

`medaka` is a tool to create consensus sequences and variant calls from nanopore sequencing data. This task is performed using neural networks applied a pileup of individual sequencing reads against a draft assembly. It provides state-of-the-art results outperforming sequence-graph based methods and signal-based methods, whilst also being faster.

(see https://github.com/nanoporetech/medaka)

Medaka is not compatible with the curso_4 env. **A new environment is necessary**. In a new terminal type : 

!conda create -y --name medaka python=3.6 

!conda install -y -n medaka medaka jupyter

!conda activate medaka

!jupyter notebook &

!medaka_consensus -h

In [None]:
!medaka_consensus \
    -i data/bacillus_subtilis/bs_reads.fastq.gz \
    -d bs_assembly_miniasm_r4.fasta \
    -m r10_min_high_g340 \
    -t 4 \
    -o bs_miniasm_medaka \
    -f

In [None]:
%ls bs_miniasm_medaka

In [None]:
!dnadiff \
    -p bs_miniasm_medaka \
    data/bacillus_subtilis/bs_ref.fasta \
    bs_miniasm_medaka/consensus.fasta 2> err

!cat bs_miniasm_medaka.report

Let us try running `medaka` directly on the unpolished assembly.

In [None]:
!medaka_consensus \
    -i data/bacillus_subtilis/bs_reads.fastq.gz \
    -d data/bacillus_subtilis/ bs_assembly_miniasm.fasta \
    -m r10_min_high_g340 \
    -t 4 \
    -o bs_raw_miniasm_medaka \
    -f

In [None]:
!dnadiff \
    -p bs_raw_miniasm_medaka \
    data/bacillus_subtilis/bs_ref.fasta \
    bs_raw_miniasm_medaka/consensus.fasta 2> err

!cat bs_raw_miniasm_medaka.report

# POLISHING WITH NANOPOLISH

With the fact that `nanopolish`-ing takes a lot of time, we will not run it.  
(see https://github.com/jts/nanopolishan).  
Software package for signal-level analysis of Oxford Nanopore sequencing data. Nanopolish can calculate an improved consensus sequence for a draft genome assembly, detect base modifications, call SNPs and indels with respect to a reference genome and more (see Nanopolish modules, below).

In [None]:
!nanopolish --help

First we need to let `nanopolish` connect the raw data (fast5) and the basecalled data. 

# OTHERS TOOLS

**[Pilon](https://github.com/broadinstitute/pilon)**.  **With Illumina data**.  
(see https://denbi-nanopore-training-course.readthedocs.io/en/latest/polishing/pilon/index.html).

# Comparative evaluation of Nanopore polishing tools

See https://doi.org/10.1038/s41598-021-00178-w 

In [None]:
from IPython.display import Image
Image(url = "https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fs41598-021-00178-w/MediaObjects/41598_2021_178_Fig1_HTML.png?as=webp")

______________________________________________________________________________________

**RESULTS**

Complete the following results table :

|Assembly Tools| Raw reads | Corrected reads| Assembly File name | Polishing tools | Polished File name | 
|---|---|---|---|---|---|
| Minimap/Miniasm| x| | | | |
| Flye| x| | | | |
| Flye/Canu| | x| | | |
| Raven| x| | | | |
| Raven/Canu| | x| | | |
| Canu| | x| | | |
| Shasta| x| | | | |
| Shasta/Canu| | x| | | |
| | | | | | |
| | | | | |  |