# SESSION 5

Prerequisites: In a terminal, You need to create, install biopython and activate the `Conda` env as follow before to start jupyter

**We will create a new env called curso_5**

!conda create -y --name curso_5

!conda install -y -n curso_5 -c bioconda jupyter quast busco

!conda activate curso_5

!jupyter notebook &

# Quality assessment with [QUAST](https://github.com/ablab/quast)

So far we have used `dnadiff` from the `mummer` package to evaluate our assembly given the reference genome. We obitaned the average accuracy, number of break points, number of SNPs and indels. A similar tool for meassuring the quality of an assembly is [`quast`](https://github.com/ablab/quast), which can be run with or without the reference file. It will yield a summary that describes how fragmented is the assembly. This is incorporated in the NG50 value (N50 without a reference), which is almost always used to compare different assemblers. It equalst the lenght of the smallest contig which together with all longer contigs covers $50\%$ of the genome. When given the reference genome, `quast` will also tell us the number of translocations, relocations and inversions, and the percentage of mismatches and indels.  


To be able to run `quast` you need a `conda` environment with `quast` package installed.

In [1]:
!quast -h

QUAST: Quality Assessment Tool for Genome Assemblies
Version: 5.2.0

Usage: python /opt/anaconda3/envs/curso_5/bin/quast [options] <files_with_contigs>

Options:
-o  --output-dir  <dirname>       Directory to store all result files [default: quast_results/results_<datetime>]
-r                <filename>      Reference genome file
-g  --features [type:]<filename>  File with genomic feature coordinates in the reference (GFF, BED, NCBI or TXT)
                                  Optional 'type' can be specified for extracting only a specific feature type from GFF
-m  --min-contig  <int>           Lower threshold for contig length [default: 500]
-t  --threads     <int>           Maximum number of threads [default: 25% of CPUs]

Advanced options:
-s  --split-scaffolds                 Split assemblies by continuous fragments of N's and add such "contigs" to the comparison
-l  --labels "label, label, ..."      Names of assemblies to use in reports, comma-separated. If contain spa

In [2]:
!quast \
    -t 4 \
    --fast \
    --silent \
    -o bs_assembly_miniasm \
    bs_assembly_miniasm.fasta

/opt/anaconda3/envs/curso_5/bin/quast -t 4 --fast --silent -o bs_assembly_miniasm bs_assembly_miniasm.fasta


System information:
  OS: Darwin-22.2.0-x86_64-i386-64bit (macosx)
  Python version: 3.7.12
  CPUs number: 4

Started: 2023-01-26 11:00:42

Logging to /Users/guyot/FORMATION/Cours_Nanopore_Londrina/SESSION_5/bs_assembly_miniasm/quast.log

CWD: /Users/guyot/FORMATION/Cours_Nanopore_Londrina/SESSION_5
Main parameters: 
  MODE: default, threads: 4, min contig length: 500, min alignment length: 65, min alignment IDY: 95.0, \
  ambiguity: one, min local misassembly length: 200, min extensive misassembly length: 1000

Contigs:
  Pre-processing...
  bs_assembly_miniasm.fasta ==> bs_assembly_miniasm

2023-01-26 11:00:47
Running Basic statistics processor...
Done.

NOTICE: Genes are not predicted by default. Use --gene-finding or --glimmer option to enable it.

2023-01-26 11:00:47
RESULTS:
  Text versions of total report are saved to /Users/guyot/FORMATION/Cours_Nanopore_Londrina/SESSIO

In [3]:
!cat bs_assembly_miniasm/report.txt

All statistics are based on contigs of size >= 500 bp, unless otherwise noted (e.g., "# contigs (>= 0 bp)" and "Total length (>= 0 bp)" include all contigs).

Assembly                    bs_assembly_miniasm
# contigs (>= 0 bp)         1                  
# contigs (>= 1000 bp)      1                  
# contigs (>= 5000 bp)      1                  
# contigs (>= 10000 bp)     1                  
# contigs (>= 25000 bp)     1                  
# contigs (>= 50000 bp)     1                  
Total length (>= 0 bp)      3931083            
Total length (>= 1000 bp)   3931083            
Total length (>= 5000 bp)   3931083            
Total length (>= 10000 bp)  3931083            
Total length (>= 25000 bp)  3931083            
Total length (>= 50000 bp)  3931083            
# contigs                   1                  
Largest contig              3931083            
Total length                3931083            
N50                         3931083            
N90   

In [5]:
!quast \
    -t 4 \
    --silent \
    -o bs_quast \
    --min-identity 80.0 \
    -r data/bacillus_subtilis/bs_ref.fasta \
    bs_assembly_miniasm.fasta \
    bs_assembly_miniasm_r4.fasta \
    consensus.fasta > err 2>&1

#!cat bs_quast/report.txt

For a more detailed summary, you need to open the `report.pdf` file from `quast` output directory.

When we do not have a reference genome and are doing *de novo* assembly, we would also like to somehow evaluate our results. For chromosome coverage and fragmentation, we can use `quast` without the reference genome to see the value of the N50 meassure. For accuracy, we can try and translate our DNA to proteins and search a protein database for matches. If our assembly is full of insertions and deletions, the open reading frames could be shifted and the resulting proteins will be without matches in the database. Luckily, there are two tools that are doing exactly this.

# Quality assessment with [BUSCO](https://gitlab.com/ezlab/busco)

`Busco` assesses the assembly based on evolutionarily informed expectations of gene content. It searches for single copy orthologs in the determined lineage of the genome we sequenced and calculates the fractions of complete, fragmented and missing orthologs.

On the other hand, `ideel` (indels are not ideal) is a pipeline which tranlates all proteins, searches for their the best match in a protein database and calculates the length ratio between each pair. Afterwards, it draws the histogram of length ratios which should have a peak at value $1$. Assemblies that have lots of errors will have many proteins that are truncated.

Let us first see how `busco` evaluates our `racon` polished assembly and the `medaka` polished assembly. You will need a `conda` environment with `busco` packaged installed.

In [7]:
!busco --help

No module named 'busco'
There was a problem installing BUSCO or importing one of its dependencies. See the user guide and the GitLab issue board (https://gitlab.com/ezlab/busco/issues) if you need further assistance.


In [None]:
!busco --list-datasets

In [None]:
!busco \
    -c 4 \
    -f \
    --quiet \
    -m genome \
    -l bacillales_odb10 \
    -o bs_miniasm_busco \
    -i bs_assembly_miniasm.fasta

In [None]:
!busco \
    -c 4 \
    -f \
    --quiet \
    -m genome \
    -l bacillales_odb10 \
    -o bs_miniasm_r4_busco \
    -i bs_assembly_miniasm_r4.fasta

In [None]:
!busco \
    -c 4 \
    -f \
    --quiet \
    -m genome \
    -l bacillales_odb10 \
    -o bs_consensus_busco \
    -i consensus.fasta

In [None]:
Put all files called "short summary" in a same directory (my_summaries)

In [None]:
!generate_plot.py -wd my_summaries