# "Molecular Microbial Ecology" MBGW2.2.3

## A look into the theory behind (meta)genome assembly
Metagenome datasets usually originate from the sequencing of environmental genomic DNA, which means that sequenced DNA fragments originate from thousands to millions of taxonomically diverse microbes. With reference to metagenome assembly this kind of sounds like a problem and yes it is a problem. High-throughput sequencing platforms generate between 100Ks and 1Bs of sequence reads. How do we stitch them together?

This question was actually answered quite a while ago, namely in the city of Königsberg, by the mathematician Leonhard Euler, who founded graph theory ⁠. The Königsberg problem refers to the seven bridges of Königsberg. In the 18<sup>th</sup> century the residents of Königsberg wondered whether every part of the city could be visited by walking each bridge only once and returning to ones starting point. Eulers idea was as simple as brilliant. He imagined every landmass of the city as a node, and every bridge as an edge, resulting in a network of nodes connected by edges. The original question was thus simplified to finding a path in the network, which visits every edge once.

<img src="img/konigsberg.png" alt="The bridges of Königsberg problem" width="750"/>

<font size="2"> The bridges of Königsberg problem? [_Compeau et al., 2011_](https://www.nature.com/articles/nbt.2023) </font>

This idea was taken further by Nicolas de Bruijn, who invented de Bruijn graphs. A de Bruijn graph is a cyclic sequence of letters taken from a given alphabet for which every possible word of a certain length (k) appears as a string of consecutive characters in the cyclic sequence exactly once. Transferring this idea to metagenome sequencing means the following: we split our sequenced metagenome reads into k-mers with a length k and plot them in a graph where nodes represent (k-1)-mers and edges represent our k-mers. From a computational perspective this is still demanding, but it is a solvable problem. Most currently used (meta)genome assemblers are based on de Bruijn graphs, and while first assemblers had high computational requirements, primarily RAM for storing intermediate de Bruijn graphs, recent assemblers are able to run on almost standard, everyday hardware.

<img src="img/seq.png" alt="K-mers and (K-1)-mers" width="500"/>
<img src="img/debruijn.png" alt="De Bruijn graphs" width="750"/>
<font size="2"> Example, de Bruijn graph-based assembly. </font>

## Taxonomic assignment of metagenomic sequence reads
Before we assembly our sequence data we want to get a first glimpse into the taxonomic affiliation of our sequence reads. For this purpose many tools are available. A lot of them make use of k-mer or substring matching. For our datasets we will use `kaiju` with its available [webserver](http://kaiju.binf.ku.dk/server).

<img src="img/kraken.png" alt="Kraken" width="750"/>

<font size="2"> The k-mer matching based algorithm underlying Kraken, [_Wood and Salzberg, 2014_](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-3-r46) </font>

<img src="img/kaiju.png" alt="Kaiju" width="750"/>

<font size="2"> The substring matching approach of Kaiju, [_Menzel et al., 2016_](https://www.nature.com/articles/ncomms11257) </font>

## Session 02 | Initial taxonomic assignment and metagenome assembly

### Quick taxonomic analysis with kaiju

Let's do this quick and dirty taxonomic analysis.

<div class="alert alert-block alert-warning">
<b>QUESTION/TASK:</b> 
<ul>
  <li>Download forward and reverse reads and upload them to the Kaiju webserver. Execute the analysis with default settings, the webserver will run for approx. 30 min.</li>
  <li>What is your impression of the taxonomic diversity within our dataset?</li>
  <li>How big is the proportion of taxonomically assigned sequences?</li>
  <li>What do you think are typical assignment rates when you deal with metagenome datasets?</li>  
</ul>
</div>

<img src="img/kaiju_mask.png" alt="Kaiju UI" width="500"/>

<font size="2"> _Kaiju's_ user interface / job submission form. </font>

### A quick word about conda

Throughout the course we use many different software tools. A lot of these tools have conflicting dependencies (tool X relies on package Z, tool Y too, but tool Y needs version 0.2 of Z, while tool X needs version 0.1). In order to have them available anyway, we use `virtual environments` (and `container`) a lot in bioinformatics.

`conda` is a package manager and tool to set up and manage `virtual environments`. `Virtual environments` can be best imagined as encapsulations that are to a certain extent isolated from the respective computer system/operating system. 

By default, every user on our servers can make use of global `conda`.

One can check available `conda` environments as follows:

`conda info --envs`

Environments are activated and deactivated using the following commands:

`conda activate <name_of_environment>`

`conda deactivate`

### Metagenome co-assembly with MEGAHIT and METASPADES

Having some first taxonomic information available, we proceed to metagenome assembly. We will use two different assemblers, [`MEGAHIT`](https://github.com/voutcn/megahit) and [`SPADES`](https://cab.spbu.ru/software/spades/). `MEGAHIT` is a very fast, memory-efficient short-read assembler, while `SPADES` is commonly considered the "best" short-read assembler available.

In [16]:
### We switch to our workshop folder and activate a needed conda environment
# cd ~/data/MBGW223
# conda activate assemblers
### We first use MEGAHIT to co-assemble our two datasets ...
# megahit -1 ./00_READS/dataset_A_1.fastq.gz,./00_READS/dataset_B_1.fastq.gz  -2 ./00_READS/dataset_A_2.fastq.gz,./00_READS/dataset_B_2.fastq.gz -m 0.05 -t 8 -o ./02_ASSEMBLY_BINNING/assembly_megahit
### ... and then SPADES in metagenome mode
# cat ./00_READS/*1.fastq.gz > ./00_READS/merged_1.fastq.gz
# cat ./00_READS/*2.fastq.gz > ./00_READS/merged_2.fastq.gz
# spades.py -t 8 --only-assembler -o ./02_ASSEMBLY_BINNING/assembly_spades --meta -1 ./00_READS/merged_1.fastq.gz -2 ./00_READS/merged_2.fastq.gz 


>NODE_1_length_925458_cov_9.567712
ATTGAATAAAAACTAATATTTCTGAAGATCAAAAGGTGAGCTTAGAAATGAAAATGAATT
TTATTTATTCCCTTTTAACGATAGTTCCTAACCAGAAGGCTTTTGACTGGAATTGAGAGC
TCGTCGATGGCCAATTTTCAGTAGTGAAATTATTTAGCGTGAAGGAAAAATCTTGAGCTG
TAAACCAGCTCAAATCGGGAATATGTTTAAATGAATTTGTATTTTCAATTTTAGATGTAC
CCCTACCATTGGACACGCTTGAATCGATTTCATTGCAACAATGGGGGCTTTTAACAACTT
CTTTATCTTCGAAATCTTCCGTTTCATGATCCTGTACATCTATAGAAGAGGTGTGTGCAT
TAGCAGTATGAACAAAACTATGGGATTTAAAGCAAGTATTACCTGCTAGGGAAAAACCAA
CAATGAGAACGAAAATTATATGAAGAAGTTTTTTCACAATTTTTATAGTACTTGGCCGGT
ACCTTAATGGGAAAGCTGTAATTTGTCAAAATACTATTTTAGTGCCCCATCATTGAGTGG


A quick word about the used parameters:

`--presets`: MEGAHIT comes with different k-mer presets dependent on the complexity of the metagenome

`-t`: We specify eight threads for the assembly, time is money, but we also do not want to block all our server capacity

`-1 / -2`: Forward and reverse reads to assemble

`--meta`: Similar to _MEGAHIT_, _SPADES_ comes with presets, --meta (surprise) is the preset for metagenomes

`-o`: output folders

`--only-assembler`: SPADES usually starts the assembly with a very thorough QC based on provided .fastq files

Which assembler did a better job, how can we assess this, what are parameters/variables to look at? We use [`QUAST`](https://github.com/ablab/quast) to check out our assemblies.

<div class="alert alert-block alert-warning">
<b>QUESTION/TASK:</b> 
<ul>
  <li>Use Use QUAST to get some basic statistics about the assembled sets of contigs.</li>
  <li>What parameters are calculated, how can they help you judge the performance of the assemblers?</li>
</ul>
</div>

In [None]:
### We switch the conda environment and use quast.py to check the assemblies
cd ~/data/MBGW223
conda deactivate
conda activate metawrap
### We check first the MEGAHIT assembly and then the SPADES assembly
quast.py -o ./02_ASSEMBLY_BINNING/assembly_megahit/quast ./02_ASSEMBLY_BINNING/assembly_megahit/final.contigs.fa
quast.py -o ./02_ASSEMBLY_BINNING/assembly_spades/quast ./02_ASSEMBLY_BINNING/assembly_spades/contigs.fasta

Used parameters:

`-o`: we define an output folder, where `QUAST` should store its reports.

Once the assembly is done, one important question is - "With how many genomes are we dealing with in our assembly?". In order to answer this question we make use of [`anvio`](http://merenlab.org/software/anvio/). _anvio_ is "an open-source, community-driven analysis and visualization platform for microbial ‘omics. It brings together many aspects of today’s cutting-edge strategies including genomics, metagenomics, metatranscriptomics, pangenomics, metapangenomics, phylogenomics, and microbial population genetics in an integrated and easy-to-use fashion through extensive interactive visualization capabilities". Few years ago we had the chance to host a workshop together with the main developer of anvio , which was a great experience. `anvio` is one lighthouse example for a community-effort to make cutting edge methods available to a broad group of (non-nerdy) users.

<div class="alert alert-block alert-warning">
<b>QUESTION/TASK:</b> 
How many genomes can we expect from the assembly?
</div>

In [18]:
### We use anvio to estimate the number of genomes in our two assemblies
### anvio expects that are contig names are formatted in consistent manner
### How do our two .fasta files with assembled contigs look?!
cd ~/data/MBGW223
conda deactivate
conda activate anvio7
head ./02_ASSEMBLY_BINNING/assembly_megahit/final.contigs.fa
head ./02_ASSEMBLY_BINNING/assembly_spades/contigs.fasta
### In both cases, the header includes information about the contig assembly,
### like contig length etc.
### Let's reformat them
anvi-script-reformat-fasta --simplify-names -o ./02_ASSEMBLY_BINNING/assembly_megahit/contigs_renamed.fa ./02_ASSEMBLY_BINNING/assembly_megahit/final.contigs.fa
anvi-script-reformat-fasta --simplify-names -o ./02_ASSEMBLY_BINNING/assembly_spades/contigs_renamed.fa ./02_ASSEMBLY_BINNING/assembly_spades/contigs.fasta
### We generate so called contig databases (explained below!), ...
anvi-gen-contigs-database -f ./02_ASSEMBLY_BINNING/assembly_megahit/contigs_renamed.fa -n mock_metagenome_megahit -o ./02_ASSEMBLY_BINNING/mock_metagenome_megahit-CONTIGS.db -T 6
anvi-gen-contigs-database -f ./02_ASSEMBLY_BINNING/assembly_spades/contigs_renamed.fa -n mock_metagenome_spaded -o ./02_ASSEMBLY_BINNING/mock_metagenome_spades-CONTIGS.db -T 6
### ... identify certain genes, add this information (more about that in a bit) ...,
anvi-run-hmms -c ./02_ASSEMBLY_BINNING/mock_metagenome_megahit-CONTIGS.db -I Bacteria_71,Archaea_76,Ribosomal_RNA_16S -T 6
anvi-run-hmms -c ./02_ASSEMBLY_BINNING/mock_metagenome_spades-CONTIGS.db -I Bacteria_71,Archaea_76,Ribosomal_RNA_16S -T 6
### ... and summarize them!
anvi-display-contigs-stats --report-as-text -o ./02_ASSEMBLY_BINNING/stats_mock_metagenome_megahit.txt ./02_ASSEMBLY_BINNING/mock_metagenome_megahit-CONTIGS.db
anvi-display-contigs-stats --report-as-text -o ./02_ASSEMBLY_BINNING/stats_mock_metagenome_spades.txt ./02_ASSEMBLY_BINNING/mock_metagenome_spades-CONTIGS.db

(/home/lu87neb/data/programs/conda/jupyterlab) 
>k141_8 flag=0 multi=117.0295 len=446
CCAGTCACCGTCGAATAGAACGGCACCGACGACGACCGAGGACTGATCCCGGCAAGCTCCCTCGCCAACTCGCCCTCGATCGCCTCCACATGAGCGCAATGCGAGGCGTAATCCACCGGCACACGCCGCGCCCGGACCCCCTCACCCTCACACACGGCCAGCAACTCCTCCAACGCATCCGCATCACCGGAAACCACCACCGCCAACGGCCCGTTCACCGCCGCCACCGACAACCGCTCACCCCACGCCCCCAGCCACTCCCGCACCTCCCCCACCGGCAACGACACCGACACCATCCCCCCACGCCCCGACAACGCCCGCAACGCCCTACTCCGCAACGCCACCACCCGCGCCGCATCCTCCAACGACAACCCACCCGCCACACACGCCGCCGCAATCTCACCCTGCGAATGACCCACCACCGCAGCAGGCTCCACCCCATACGA
>k141_9 flag=1 multi=36.0997 len=502
AGGTAATCTGACCATAGCCATGGCTAAAACTACGTAGTTCCAACATTGTATCACATCCTAGCCATAGGTGGCTGGCACCACCTGTGGTTATGGTGCCAAAAAAAACTTTATTTTACAAAGAAATGATGAGTGAAACAGTTATGCGAATATAAAATAGCCAGCGAGGGTTCACTTCATCCCCGACCTAAAGGTCAGGGTATTCGTGACCCTCTGCGCTCCCATAGTAATAAAGTTTAAAAGAAGAGCATACAAAGGCAAGCAAAAACAGGACGAGATGATTGTACGTCTTGTTGCAGTCTATAATGACGAGGATGAAAAGTACCATATTTATATCACAAATATTCAGAAAGATATTTTGAACGCACAAGACATTGCAAACCTATATGGAGCAAGATGGGACATAGAACTGTTGTTTAAGGAATTGAAAAGC

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



[0;36mContigs DB[0m ...................................: [0;33m./02_ASSEMBLY_BINNING/mock_metagenome_megahit-CONTIGS.db[0m
[0;36mHMM sources[0m ..................................: [0;33mBacteria_71, Archaea_76,
                                                Ribosomal_RNA_16S[0m
[0;36mAlphabet/context target found[0m ................: [0;33mRNA:CONTIG[0m
[0;36mAlphabet/context target found[0m ................: [0;33mAA:GENE[0m   [0m
[0;32m                                                                         [0m                                                                                                                                                                                                                                                                                                                                
HMM Profiling for Bacteria_71
[0m[0;36mReference[0m ....................................: [0;33mLee modified,
                             

`anvi-script-reformat-fasta --simplify-names`: anvio script to reformat contig names for a consistent nomenclature  
`anvi-gen-contigs-database`: anvio works with SQL databases to store information about contigs, databases are generated by first predicting open reading frames 
`anvi-run-hmms -c -I -T 6`: generated contig databases are screened for collections of single-copy-marker genes and 16S rRNA genes
`-f / -n /-o`: Specify input, give the database a name, define output  
`anvi-display-contigs-stats`: Script that generates basic contig statistics (similar to `QUAST` in a way...) and that predicts the number of genomes present in the contigs database  

<div class="alert alert-block alert-warning">
<b>QUESTION/TASK:</b> 
<ul>
  <li>Have a look at the generated stats .txt files, how many genomes can we expect?</li>
  <li>What resources did anvio use for its estimation of the number of genomes?</li>
</ul>
</div>

In [None]:
### Check out the stats
less ./02_ASSEMBLY_BINNING/stats_mock_metagenome_megahit.txt

contigs_db      mock_metagenome_megahit
Total Length    44253132
Num Contigs     1619
Num Contigs > 100 kb    118
Num Contigs > 50 kb     256
Num Contigs > 20 kb     462
Num Contigs > 10 kb     608
Num Contigs > 5 kb      738
Num Contigs > 2.5 kb    897
Longest Contig  684728
Shortest Contig 200
Num Genes (prodigal)    40094
L50     101
L75     250
L90     448
N50     112798
N75     51378
N90     21180
Archaea_76      459
Bacteria_71     603
Ribosomal_RNA_16S       6
bacteria (Bacteria_71)  10
archaea (Archaea_76)    3


## Session 03 | Automated binning with METAWRAP and human-augmented binning with VIZBIN

### Introduction to binning

We assembled contigs, and based on the carried out screening with [`anvio`] our assembly contains bacterial as well as archaeal MAGs (metagenome-assembled genomes). A MAG is commonly defined as a genome that is reconstructed/recovered from a metagenome through assembly and binning. We did the assembly, what about binning? Binning aims at identifying and putting together contigs that belong to a population, a group of co-existing microbes whose genomes are sufficiently similar that genome assemblages from this population map to the same reference genome.

<img src="img/binning_lego1.png" alt="Assembly done" width="500"/>

<font size="2"> We did the assembly, binning is next. © Rena Sophie Andräs </font>

What kind of information can we use to identify populations of contigs that belong together?

_Sequence composition-based:_
* GC-content
* k-mer frequencies

_Reference-based:_
* presence/absence of marker genes

_Intra-sample coherence:_
* differential coverage information

**IDEALLY WE WANT TO COMBINE AS MUCH INFORMATION AS POSSIBLE**

In the metagenomics world people compare binning often with either confetti or LEGO, we are obviously biased, but we prefer LEGO 😁!

<img src="img/binning_lego2.png" alt="Assembly done" width="500"/>

<font size="2"> LEGO is a great analogy in the context of metagenomics. © Rena Sophie Andräs </font>

### Binning with METAWRAP

For the binning process we will use [`METAWRAP`](https://github.com/bxlab/metaWRAP), or to be precise its implemented binning module. `METAWRAP` is a tool that "wraps" many different pieces of software together in the context of genome-resolved metagenomics. The strength of `METAWRAP` regarding binning is that it combines three individual binning tools:

* [`CONCOCT`](https://github.com/BinPro/CONCOCT)
* [`MAXBIN2`](https://sourceforge.net/projects/maxbin2/)
* [`METABAT2`](https://bitbucket.org/berkeleylab/metabat/src/master/)

All three make use of differential coverage information and sequence composition based on k-mer frequencies.

In [8]:
### We activate a metawrap conda environment and get going
cd ~/data/MBGW223
conda deactivate
conda activate metawrap
metawrap binning -o ./02_ASSEMBLY_BINNING/initial_binning_spades_metawrap -t 6 -a ./02_ASSEMBLY_BINNING/assembly_spades/contigs.fasta --metabat2 --maxbin2 --concoct ./00_READS/*.fastq

metawrap binning -o ./02_ASSEMBLY_BINNING/initial_binning_spades_metawrap -t 6 -a ./02_ASSEMBLY_BINNING/assembly_spades/contigs.fasta --metabat2 --maxbin2 --concoct ./00_READS/dataset_A_1.fastq ./00_READS/dataset_A_2.fastq ./00_READS/dataset_B_1.fastq ./00_READS/dataset_B_2.fastq

------------------------------------------------------------------------------------------------------------------------
-----                                           Entered read type: paired                                          -----
------------------------------------------------------------------------------------------------------------------------


------------------------------------------------------------------------------------------------------------------------
-----                                  2 forward and 2 reverse read files detected                                 -----
---------------------------------------------------------------------------------------------------------------

Used parameters:

`-`: Output folder, `METAWRAP` will generate one folder with bins each, for `CONCOCT`, `MAXBIN2`, and `METABAT2`

`-t`: We specify 6 threads to speed up the processing

`-a`: The assembled contigs that should be used for the binning

`--concoct`, `--maxbin2`, and `--metabat2`: Binning is done using these three tools

If you check the defined output folder, you see the three output folders and you can see that the three tools identified different numbers of bins:

In [9]:
### We count the number of bins in each output folder
pwd
ls ./02_ASSEMBLY_BINNING/initial_binning_spades_metawrap/concoct_bins/*.fa | wc -l
ls ./02_ASSEMBLY_BINNING/initial_binning_spades_metawrap/maxbin2_bins/*.fa | wc -l
ls ./02_ASSEMBLY_BINNING/initial_binning_spades_metawrap/metabat2_bins/*.fa | wc -l

bash: cd: /home/lu87neb/data/MBW223: No such file or directory
/home/lu87neb/data/MBGW223
11
9
11


Next we use `METAWRAP`s bin refinement module to consolidate the bins we obtained from `CONCOCT`, `MAXBIN2`, and `METABAT2`.

In [2]:
### Refining the bins
cd ~/data/MBGW223
conda activate metawrap 
metawrap bin_refinement -o ./02_ASSEMBLY_BINNING/bin_refinement_spades_metawrap -t 6 -A ./02_ASSEMBLY_BINNING/initial_binning_spades_metawrap/metabat2_bins/ -B ./02_ASSEMBLY_BINNING/initial_binning_spades_metawrap/maxbin2_bins/ -C ./02_ASSEMBLY_BINNING/initial_binning_spades_metawrap/concoct_bins/ -c 70 -x 10

metawrap bin_refinement -o ./02_ASSEMBLY_BINNING/bin_refinement_spades_metawrap -t 6 -A ./02_ASSEMBLY_BINNING/initial_binning_spades_metawrap/metabat2_bins/ -B ./02_ASSEMBLY_BINNING/initial_binning_spades_metawrap/maxbin2_bins/ -C ./02_ASSEMBLY_BINNING/initial_binning_spades_metawrap/concoct_bins/ -c 70 -x 10

------------------------------------------------------------------------------------------------------------------------
-----            There is 40 RAM and 6 threads available, and each pplacer thread uses >40GB, so I will            -----
-----                                          use 1 threads for pplacer                                           -----
------------------------------------------------------------------------------------------------------------------------


########################################################################################################################
#####                                                BEGIN PIPELINE!             

: 1

`METAWRAP` provides a very automated way for the binning. Several other tools allow "human-augmented binning", which usually makes use of differences in k-mer frequencies and differential coverage, and allows binning through visual inspection. Two examples are:

* [`anvio`](https://anvio.org/)
* [`VizBin`](https://github.com/claczny/VizBin)

We touched `anvio` before when were estimating the number of bins in our assembly. We will use `VizBin` here because of its ease of use. You can download `VizBin` from the workshop folder `misc/VizBin.jar`. You have to also download the assembled contigs.

![_Vizbin_ interface](img/vizbin1.png)  
<font size="2"> The _VizBin_ interface. </font>

We load our contigs into `VizBin`, leave all settings default, and press start. A new window should open (after few minutes, depends a bit on your laptop) and it should be full of blue dots (each blue dot represents one of our assembled contigs!).

<div class="alert alert-block alert-warning">
<b>QUESTION/TASK:</b> 
<ul>
  <li>Can you identify distinct populations of contigs? If yes, how many do you see?</li>
  <li>PLEASE NOTE, our dataset is fairly easy to bin, in real life, a complex metagenome dataset looks a lot more messy!</li>
</ul>
</div>

<img src="img/vizbin2.png" alt="Binning with VizBin" width="500"/>

<font size="2"> _VizBin_ k-mer based ordination of a complex metagenome. </font>

<div class="alert alert-block alert-warning">
<b>QUESTION/TASK:</b> 
For binning distinct populations of contigs, we now proceed as follows:
<ul>
  <li>By left-clicking around a population you create a selection window.</li>
  <li>After selecting a population right-click in the selection and export the selection (`Selection --> Export`), save the bins (e.g. bin1.fna) in a folder of your choice.</li>
  <li>Clear the selection, `right-click --> selection --> clear and repeat`. Once you are done upload the bins to your data folder on the server.</li>
</ul>
</div>

<sub> © Carl-Eric Wegner, 2023-08 </sub>