# ChemBioSys and AquaDiva | Genome-resolved metagenomics workshop

## Session 02 | Taxonomic profiling of metagenome datasets

As said before, you will be working with two mock datasets that have been generated in silico. Starting from an unknown number of genomes, we have simulated two sequencing runs, and in these two runs the original genomes are differently covered, they differ in their abundance between the two datasets. This will become relevant further down the road, once we start doing assemblies and genome binning

Before we assembly our sequence data we want to get a first glimpse into our two datasets. Having this information available can help us getting an idea what we can expect from our downstream metagenome assembly.

For the taxonomic profiling of short-read metagenome datasets, plenty of tools are available, a few are listed here:

|  **Tool**              | **Strategy**                                              |  **Reference** |
|------------------------|-----------------------------------------------------------|----------------|
|   Kraken/Kraken2       | k-mer profile matching                                    | [Wood et al., 2019](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1891-0) |
|   MetaPhlAn4           | unique SGB marker genes                                   | [Blanco-Miguez et al., 2023](https://www.nature.com/articles/s41587-023-01688-w) |
|   kaiju                | translated search, substring matching                     | [Menzel et al., 2016](https://www.nature.com/articles/ncomms11257) |
|   diamond/blast/megan  | local alignment searches, LCA-based taxonomic assignment  | [Bagci et al., 2021](https://currentprotocols.onlinelibrary.wiley.com/doi/full/10.1002/cpz1.59) |

The workflows for `Kraken`and `Kaiju`are exemplarily explained here.

### Taxonomic profiling with Kraken

<img src="img/kraken.png" alt="Kraken" width="750"/>

<font size="2"> The k-mer matching based algorithm underlying Kraken, [_Wood and Salzberg, 2014_](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-3-r46) </font>

For any query sequence, `Kraken`computes all possible k-mers of a defined length. k-mers are substrings that are contained in a sequence of length k. Imagine the sequence `ATGCGTGA`, let's identify all 3-mers (k=3).

```
ATG
TGC
CGT
GTG
TGA
```

`Kraken`uses by default a k-mer length of 31, to classify a sequence, each k-mer in a sequence is mapped to the lowest common ancestor (LCA) of the genomes that contain that k-mer. The taxa associated with the sequence’s k-mers, as well as the taxa’s ancestors form a tree, which is used for classification. 

### Taxonomic profiling with kaiju

<img src="img/kaiju.png" alt="Kaiju" width="750"/>

<font size="2"> The substring matching approach of Kaiju, [_Menzel et al., 2016_](https://www.nature.com/articles/ncomms11257) </font>

In the case of `kaiju`, a query sequence is first translated into the six possible reading frames. The resulting amino acid sequences are split into fragments at stop codons. This population of fragments is then either sorted by length or BLOSUM62 score (this score is calculated based on the most likely substitution at each position). 

The sorted fragments are then searched against a reference databases, looking for either exact matches, or matches with the highest BLOSUM62 score. In case of the latter, exact matches of a minimum seed length (length = 7) are extended in both directions allowing a certain number of mismatches. Taxonomic assignment is based on the taxonomic identity of the database sequence containing the best match. If more than one database sequence contains the match, the assignment is based on LCA.

For the workshop we make use of `kaiju` because of its smaller memory footprint (and because of the cooler logo and subtle reference to Asian monster movies... 😁).

In [None]:
### Alright we make sure we are in the workshop directory
cd ~/workshop
### We switch to the directory 00_READS, which contains our two datasets,
### and create a directory for the kaiju output we are about to generate
mkdir kaiju_out
### Now we call kaiju
kaiju-multi -z 25 -t nodes.dmp -f kaiju_db.fmi -i sample1_R1.fastq,sample2_R1.fastq,sample3_R1.fastq -j sample1_R2.fastq,sample2_R2.fastq,sample3_R2.fastq  -o sample1.out,sample2.out,sample3.out

`kaiju-multi`: run kaiju on multiple datasets at the same time
`-z`: with this option we can specify the number of threads to use    
`-t`: nodes.dmp, contains information about taxonomy nodes in the [`NCBI taxonomy`](https://www.ncbi.nlm.nih.gov/guide/taxonomy/), each taxonomy node has a unique identifier, nodes.dmp contains this identifier, as well as information about parent nodes, taxonomic rank, etc.    
`-f`, `-j`: paths to forward and reverse read files, files are given as comma-separated list    
`-o`: specified path for storing output    

Once `kaiju` is done, we want to summarize the taxonomic assignments for visual inspection.

In [None]:
### We convert the output from kaiju into so called krona plots
kaiju2krona -t nodes.dmp -n names.dmp -i kaiju.out -o kaiju.out.krona
ktImportText -o kaiju.out.html kaiju.out.krona

`-t`: see above   
`-n`: names.dmp links the information from nodes.dmp (taxonomic identifiers) to unique taxonomy names, nodes.dmp and names.dmp are needed to summarize the taxonomic profiles that we obtained from kaiju         
`-i`: kaiju output used as input for the conversion   
`-o`: specified paths for storing output    

---
🔧**TASK**

Download the krona plots, open them in a browser and check out the taxonomic composition of the two datasets. Do they differ, do you see dominating taxa?

---

When doing taxonomic profiling (or any kind of profiling) it is important to know what proportion of your data got successfully assigned. The krona plots show you the number of assigned reads for the different taxonomic ranks, but how many sequences do we have in our datasets?

In [None]:
### Assuming we are still in "00_READS" we can check the number of reads as follows
expr $(cat dataset_A_1.fastq | wc -l) / 4
expr $(cat dataset_B_1.fastq | wc -l) / 4

`expr`: tool that allows you to evaluate expressions, here the number of lines in the .fastq file divided by 4    
`$(cat dataset_A_1.fastq| wc -l)`: the .fastq file is opened with `cat` and the number of lines counted with `wc -l`    

## Background | Metagenomics and its impact on public databases

Genome-resolved metagenomics became very mainstream over the last years and it is meanwhile not uncommon to see studies that deposit thousands of genomes recovered from metagenome datasets. 

Take a look at the figure below, which shows you the genome content in ['IMG/GOLD'](https://gold.jgi.doe.gov/index) as of 2016. Until 2016, databases were heavily dominated by genomes originating from isolates. This drastically changed due to genome-resolved metagenomics.

<img src="img/magsag.png" alt="Genome content in IMG/GOLD" width="750"/>

<font size="2"> Genomes in IMG/GOLD as of 2017. [_Bowers et al., 2017_](http://dx.doi.org/10.1038/nbt.3893) </font>

Meanwhile, [`NCBI Genomes`](https://www.ncbi.nlm.nih.gov/assembly/) contains around 1.7M bacterial genomes and a lot of them originate from metagenomic studies, but only a little over 40k genomes are considered complete.

## Session 02 | Estimating metagenome coverage

Coming back to our datasets, after we got a bit of an idea what taxonomic groups are in our datasets, we now want to check how well our datasets cover sequence diversity. What does that mean or why is that important? We ultimately want to recover near-complete genomes. For that we need enough sequencing data to assemble the respective genomes successfully. If we have highly and lowly abundant taxa present, they are represented by differently sized portions of our sequencing data. 

Imagine you are after the genome of a taxon that is known to feature rather big genomes, but the abundance of this taxon in your dataset is rather low. In such a case it is important to sequence "deep" enough (generate sufficient sequencing data) for downstream assembly.

We can estimate how well our datasets cover sequence diversity by means of k-mer based rarefaction analysis. We learned about k-mers before. Rarefaction is used in Ecology to assess richness. In microbial ecology, we for instance use rarefaction to assess species richness based on 16S rRNA gene sequencing data. We continuously subsample the sequences, starting from e.g. 0 sequences and then increment the number of sampled sequences by let's say 10, and count the number of species we find. This relationship between sampled sequences and number of sequences can then be plotted. Rarefaction curves usually rise sharply and saturate once richness is nearly covered.

We can do the same with metagenomic datasets. We sample our reads and count the number of unique k-mers. For this we make use of [`nonpareil`](https://github.com/lmrodriguezr/nonpareil). Based on k-mer diversity in a dataset, `nonpareil` estimates how well the corresponding dataset covers sequence diversity. At the same time it gives you an idea how much data you need to reach good coverage. Below is some exemplary output.

<img src="img/nonpareil_cov_estimates.png" alt="Metagenome coverage estimation taken from Wegner et al., 2023" width="400"/>

<font size="2"> Metagenome coverage estimation taken from [_Wegner et al., 2023_](https://www.biorxiv.org/content/10.1101/2022.09.23.509128v1). </font>

OK, let's check our datasets.

In [None]:
### Assuming we are still in "00_READS"
### Create a directory for nonpareil output
mkdir nonpareil_out
### We interleave the pairs of reads so that we can run nonpareil
### over forward and reverse reads
cat dataset_A_[12].fastq > dataset_A_interleaved.fastq
nonpareil -s dataset_A_interleaved.fastq -T kmer -f fastq -b nonpareil_out/dataset_A -t 6
### We take a look at the nonpareil output
cat nonpareil_out/dataset_A.npo | awk -F"\t" '{print $1 "\t" $2 "\t" "$3"}'

`cat`: command that prints and concatenates files    
`[12]`: we make use of wildcards for the concatenation    
`-s`: input file for `Nonpareil`    
`-T`: method for coverage estimation, we use k-mers (length = 21)    
`-f`: format of the input file    
`-b`: specified output path    
`-t`: number of threads to use in parallel, here 6    
`awk`: commandline programming language for text file manipulation    
`-F`: field separator of input file, here `TAB`   
`print`: print selected columns here, columns 1, 2, and 3

The first column indicates the sequencing effort (= amount of data sampled), the second the average coverage, and the third one the standard deviation of the coverage. We see that in the case of dataset A, sequence diversity is covered to almost 100% and that 90% coverage would have been reached with less sequencing effort.

---
🔧**TASK**

Check dataset B in the same way.

---

## Background | A look into the theory behind (meta)genome assembly

Alright, after looking at sequence data formats, checking out exemplary QC reports, taxonomic profiling, and metagenome coverage estimation, we now move on talking about metagenome assembly. 

Metagenome datasets usually originate from the sequencing of environmental genomic DNA, which means that sequenced DNA fragments originate from many different, taxonomically diverse microbes. With reference to metagenome assembly this kind of sounds like a problem and yes it is a problem. High-throughput sequencing platforms generate between 100Ks and 1Bs of sequence reads. How do we stitch them together?

This question was actually, kind of answered quite a while ago, namely in the city of Königsberg, by the mathematician Leonhard Euler, who founded [graph theory](https://en.wikipedia.org/wiki/Graph_theory) while working on the Königsberg problem. 

The Königsberg problem refers to the seven bridges of Königsberg. In the 18<sup>th</sup> century the residents of Königsberg wondered whether every part of the city could be visited by walking each bridge only once and returning to ones starting point. Eulers idea was as simple as brilliant. He imagined every landmass of the city as a node, and every bridge as an edge, resulting in a network of nodes connected by edges. The original question was thus simplified to finding a path in the network, which uses every edge once.

<img src="img/konigsberg.png" alt="The bridges of Königsberg problem" width="750"/>

<font size="2"> The bridges of Königsberg problem? [_Compeau et al., 2011_](https://www.nature.com/articles/nbt.2023) </font>

This idea was taken further by Nicolas de Bruijn, who invented de Bruijn graphs. A de Bruijn graph is a cyclic sequence of letters taken from a given alphabet for which every possible word of a certain length (k) appears as a string of consecutive characters in the cyclic sequence exactly once. 

Transferring this idea to metagenome assembly means the following: 

1. we split our sequenced metagenome reads into k-mers with a length k,
2. we plot deduced (k-1)-mers as nodes,
3. and connect them using the k-mers as edges.

Now we walk the graph and use every k-mer only once. A simple example is shown below.

From a computational perspective, this is still demanding, but it is a solvable problem. Most currently used (meta)genome assemblers are based on de Bruijn graphs, and while first assemblers had high computational requirements, primarily RAM for storing intermediate de Bruijn graphs, recent assemblers are able to run on almost standard, everyday hardware.

<img src="img/seq.png" alt="K-mers and (K-1)-mers" width="500"/>
<img src="img/debruijn.png" alt="De Bruijn graphs" width="750"/>
<font size="2"> Example, de Bruijn graph-based assembly. </font>

---
❓**QUESTION**

This sounds so great and simple, why do we then struggle often with getting good, complete, and closed assemblies? How can we solve these problems?

---

## Session 02 | Metagenome co-assembly with MEGAHIT and METASPADES

Let's assemble our datasets. We will use two different assemblers, [`MEGAHIT`](https://github.com/voutcn/megahit) and [`SPADES`](https://cab.spbu.ru/software/spades/). `MEGAHIT` is a very fast, memory-efficient short-read assembler, while `SPADES` is commonly considered the "best" short-read assembler available.

In [None]:
### We switch to our workshop folder and activate the needed conda environment
cd ~/workshop
conda activate assemblers
### We first use MEGAHIT to co-assemble our two datasets ...
megahit -1 ./00_READS/dataset_A_1.fastq.gz,./00_READS/dataset_B_1.fastq.gz  -2 ./00_READS/dataset_A_2.fastq.gz,./00_READS/dataset_B_2.fastq.gz -m 0.05 -t 8 -o ./02_ASSEMBLY_BINNING/assembly_megahit
### ... and then SPADES in metagenome mode
cat ./00_READS/*1.fastq.gz > ./00_READS/merged_1.fastq.gz
cat ./00_READS/*2.fastq.gz > ./00_READS/merged_2.fastq.gz
spades.py -t 8 --only-assembler -o ./02_ASSEMBLY_BINNING/assembly_spades --meta -1 ./00_READS/merged_1.fastq.gz -2 ./00_READS/merged_2.fastq.gz 


`--presets`: MEGAHIT comes with different k-mer presets dependent on the complexity of the metagenome    
`-t`: We specify eight threads for the assembly, time is money, but we also do not want to block all our server capacity    
`-1 / -2`: Forward and reverse reads to assemble    
`--meta`: Similar to _MEGAHIT_, _SPADES_ comes with presets, --meta (surprise) is the preset for metagenomes    
`-o`: output folders    
`--only-assembler`: SPADES usually starts the assembly with a very thorough QC based on provided .fastq files    

Which assembler did a better job, how can we assess this, what are parameters/variables to look at? We use [`QUAST`](https://github.com/ablab/quast) to check out our assemblies.

---
❓/🔧**QUESTION/TASK**

* Use QUAST to get some basic statistics about the assembled sets of contigs.
* What parameters are calculated (check N50, L50), how can they help you judge the performance of the assemblers?

---

In [None]:
### We switch the conda environment and use quast.py to check the assemblies
conda deactivate
conda activate metawrap
### We check first the MEGAHIT assembly and then the SPADES assembly
quast.py -o ./02_ASSEMBLY_BINNING/assembly_megahit/quast ./02_ASSEMBLY_BINNING/assembly_megahit/final.contigs.fa
quast.py -o ./02_ASSEMBLY_BINNING/assembly_spades/quast ./02_ASSEMBLY_BINNING/assembly_spades/contigs.fasta

Used parameters:

`-o`: we define an output folder, where `QUAST` should store its reports.

### Estimating the number of genomes in our assemblies

Once the assembly is done, one important question is - "With how many genomes are we (potentially) dealing with in our assembly?". In order to answer this question we make use of [`anvio`](http://merenlab.org/software/anvio/). _anvio_ is "an open-source, community-driven analysis and visualization platform for microbial ‘omics. It brings together many aspects of today’s cutting-edge strategies including genomics, metagenomics, metatranscriptomics, pangenomics, metapangenomics, phylogenomics, and microbial population genetics in an integrated and easy-to-use fashion through extensive interactive visualization capabilities". `anvio` is one lighthouse example for a community-effort to make cutting edge methods available to a broad group of (non-nerdy) users. 💗💗💗

---
❓/🔧**QUESTION/TASK**

How many genomes can we expect from the assembly?

---

In [None]:
### We use anvio to estimate the number of genomes in our two assemblies
### anvio expects that are contig names are formatted in consistent manner
### How do our two .fasta files with assembled contigs look?!
conda deactivate
conda activate anvio7
head ./02_ASSEMBLY_BINNING/assembly_megahit/final.contigs.fa
head ./02_ASSEMBLY_BINNING/assembly_spades/contigs.fasta
### In both cases, the header includes information about the contig assembly,
### like contig length etc.
### Let's reformat them
anvi-script-reformat-fasta --simplify-names -o ./02_ASSEMBLY_BINNING/assembly_megahit/contigs_renamed.fa ./02_ASSEMBLY_BINNING/assembly_megahit/final.contigs.fa
anvi-script-reformat-fasta --simplify-names -o ./02_ASSEMBLY_BINNING/assembly_spades/contigs_renamed.fa ./02_ASSEMBLY_BINNING/assembly_spades/contigs.fasta
### We generate so called contig databases (explained below!), ...
anvi-gen-contigs-database -f ./02_ASSEMBLY_BINNING/assembly_megahit/contigs_renamed.fa -n mock_metagenome_megahit -o ./02_ASSEMBLY_BINNING/mock_metagenome_megahit-CONTIGS.db -T 6
anvi-gen-contigs-database -f ./02_ASSEMBLY_BINNING/assembly_spades/contigs_renamed.fa -n mock_metagenome_spaded -o ./02_ASSEMBLY_BINNING/mock_metagenome_spades-CONTIGS.db -T 6
### ... identify certain genes, add this information (more about that in a bit) ...,
anvi-run-hmms -c ./02_ASSEMBLY_BINNING/mock_metagenome_megahit-CONTIGS.db -I Bacteria_71,Archaea_76,Ribosomal_RNA_16S -T 6
anvi-run-hmms -c ./02_ASSEMBLY_BINNING/mock_metagenome_spades-CONTIGS.db -I Bacteria_71,Archaea_76,Ribosomal_RNA_16S -T 6
### ... and summarize them!
anvi-display-contigs-stats --report-as-text -o ./02_ASSEMBLY_BINNING/stats_mock_metagenome_megahit.txt ./02_ASSEMBLY_BINNING/mock_metagenome_megahit-CONTIGS.db
anvi-display-contigs-stats --report-as-text -o ./02_ASSEMBLY_BINNING/stats_mock_metagenome_spades.txt ./02_ASSEMBLY_BINNING/mock_metagenome_spades-CONTIGS.db

`anvi-script-reformat-fasta --simplify-names`: anvio script to reformat contig names for a consistent nomenclature      
`anvi-gen-contigs-database`: anvio works with SQL databases to store information about contigs (length, location of open reading frames, annotation data, ...), databases are generated by first predicting open reading frames     
`anvi-run-hmms -c -I -T 6`: generated contig databases are screened for collections of single-copy-marker genes and 16S rRNA genes   
`-f / -n /-o`: specify input, give the database a name, define output  
`anvi-display-contigs-stats`: script that generates basic contig statistics (similar to `QUAST` in a way...) and that predicts the number of genomes present in the contigs database  

---
❓**QUESTION**

* Have a look at the generated stats .txt files, how many genomes can we expect?
* What resources did anvio use for its estimation of the number of genomes?

---

In [None]:
### Check out the stats
less ./02_ASSEMBLY_BINNING/stats_mock_metagenome_megahit.txt

## Session 03 | Automated binning with METAWRAP

### Background | Introduction to binning

We assembled contigs, and based on the carried out screening with `anvio` our assembly contains bacterial as well as archaeal MAGs (metagenome-assembled genomes). A MAG is commonly defined as a genome that is reconstructed/recovered from a metagenome through assembly and binning. We did the assembly, what about binning? Binning aims at identifying and putting together contigs that belong to a population, a group of co-existing microbes whose genomes are sufficiently similar that genome assemblages from this population map to the same reference genome.

<img src="img/binning_lego1.png" alt="Assembly done" width="500"/>

<font size="2"> We did the assembly, binning is next. © Rena Sophie Andräs </font>

What kind of information can we use to identify populations of contigs that belong together?

_Sequence composition-based:_
* GC-content
* k-mer frequencies

_Reference-based:_
* presence/absence of marker genes

_Intra-sample coherence:_
* differential coverage information

**»» IDEALLY WE WANT TO COMBINE AS MUCH INFORMATION AS POSSIBLE ««**

In the metagenomics world people compare binning often with either confetti or LEGO, we are obviously biased and prefer LEGO 😁!

<img src="img/binning_lego2.png" alt="Assembly done" width="500"/>

<font size="2"> LEGO is a great analogy in the context of metagenomics. © Rena Sophie Andräs </font>

There are many different LEGO themes/lines (think about LEGO City, Classic, Technic, ...). You can imagine genomes as finished LEGO sets, and assemblies as a mess of disassembled sets. Every brick in this mess represents one contig. Now we sort them and put them back together, we bin them, by color, form, theme, etc..

### Binning with METAWRAP

For the binning process we will use [`METAWRAP`](https://github.com/bxlab/metaWRAP), or to be precise its implemented binning module. `METAWRAP` is a tool that "wraps" many different pieces of software together in the context of genome-resolved metagenomics. The strength of `METAWRAP` regarding binning is that it combines three individual binning tools:

* [`CONCOCT`](https://github.com/BinPro/CONCOCT)
* [`MAXBIN2`](https://sourceforge.net/projects/maxbin2/)
* [`METABAT2`](https://bitbucket.org/berkeleylab/metabat/src/master/)

All three make use of differential coverage information and sequence composition based on k-mer frequencies.

In [None]:
### We activate the metawrap conda environment and get going
conda deactivate
conda activate metawrap
metawrap binning -o ./02_ASSEMBLY_BINNING/initial_binning_spades_metawrap -t 6 -a ./02_ASSEMBLY_BINNING/assembly_spades/contigs.fasta --metabat2 --maxbin2 --concoct ./00_READS/*.fastq

`-`: output folder, `METAWRAP` will generate one folder with bins each, for `CONCOCT`, `MAXBIN2`, and `METABAT2`    
`-t`: we specify 6 threads to speed up the processing    
`-a`: the assembled contigs that should be used for the binning    
`--concoct`, `--maxbin2`, and `--metabat2`: Binning is done using these three tools    

If you check the defined output folder, you see the three output folders and you can see that the three tools identified different numbers of bins:

In [None]:
### We count the number of bins in each output folder
ls ./02_ASSEMBLY_BINNING/initial_binning_spades_metawrap/concoct_bins/*.fa | wc -l
ls ./02_ASSEMBLY_BINNING/initial_binning_spades_metawrap/maxbin2_bins/*.fa | wc -l
ls ./02_ASSEMBLY_BINNING/initial_binning_spades_metawrap/metabat2_bins/*.fa | wc -l

`ls`: we list only .fa files (using the `*` wildcard)...    
`wc -l`: ... and count the number of files in the individual output folders    

Next we use `METAWRAP`s bin refinement module to consolidate the bins we obtained from `CONCOCT`, `MAXBIN2`, and `METABAT2`. What `METAWRAP` does is, it creates first hybrid sets of bins from sets of existing bins. In our case we have bins from `METABAT2` (set A), , `MAXBIN2` (set B), and `CONCOCT` (set C), yielding four hybrid sets: `AB`, `AC`, `BC`, and `ABC`. The completion and contamination (covered below) is determined for each bin and the 7 sets of bins are then compared to each other. If a bin is identified in two sets (based on >80% genome overlap), the better version (based on completion and contamination) is kept. Contigs that are present in more than one bin are removed and only kept in the highest quality bin. The final set of refined bins is again checked for completion and contamination.

In [None]:
### Refining the bins
metawrap bin_refinement -o ./02_ASSEMBLY_BINNING/bin_refinement_spades_metawrap -t 6 -A ./02_ASSEMBLY_BINNING/initial_binning_spades_metawrap/metabat2_bins/ -B ./02_ASSEMBLY_BINNING/initial_binning_spades_metawrap/maxbin2_bins/ -C ./02_ASSEMBLY_BINNING/initial_binning_spades_metawrap/concoct_bins/ -c 70 -x 10

`-o`: specified output folder for the final set of refined bins    
`-t`: we use multiple threads    
`-A`-`-C`: set of originally binned bins to use for the refinement    
`-c` & `-x`: cutoffs for genome completion (`-c`) and (`-x`) contamination    

---
❓/🔧**QUESTION/TASK**

How good are your bins, check the summary from the bin refinement module: `./02_ASSEMBLY_BINNING/bin_refinement_spades_metawrap/metawrap_70_10_bins.stats`

---


### Human-augmented binning

`METAWRAP` provides a very automated way for the binning process. Several other tools allow "human-augmented binning", which usually make use of differences in k-mer frequencies and differential coverage, and allow binning and refinement through visual inspection. Three examples are:

* [`anvio`](https://anvio.org/)
* [`VizBin`](https://github.com/claczny/VizBin)
* [`BusyBeeWeb`](https://ccb-microbe.cs.uni-saarland.de/busybee/) »» based on VizBin comes with a web interface

OK, let's give `BusyBeeWeb` a try. 

1. We download our contigs and upload them to the `BusyBeeWeb`-Server,
2. we request taxonomic annotation of our contigs,
3. and run BusyBeeWeb.

This should only take a few minutes.

<img src="img/busybee_1.png" alt="The BusyBeeWeb landing page." width="500"/>

<font size="2"> The BusyBeeWeb landing page. </font>

<img src="img/busybee_2.png" alt="The upload mask." width="500"/>

<font size="2"> The upload mask. </font>

<img src="img/busybee_3.png" alt="Exemplary results." width="500"/>

<font size="2"> Exemplary results. </font>

---
❓**QUESTION**

How many bins did `BusyBeeWeb` find in comparison to `METAWRAP`?

---
#### About anvio

`anvio` is extremely well-documented, and hardly matched by any other open source bioinformatics software project in terms of openness and service to the scientific community. If you are interested in genome-resolved metagenomics and just want to learn check out some `anvio`-related resources:

* [`anvio homepage`](https://anvio.org/) »» plenty of resources of meta'omics
* [`Tara Oceans Genome-resolved metagenomics`](https://merenlab.org/data/tara-oceans-mags/) »» story behind [this](https://doi.org/10.1038/s41564-018-0176-9) paper, complete walkthrough, great start if you want to familiarize yourself with `anvio`
* [`Blog post about manual refinement`](https://merenlab.org/2017/05/11/anvi-refine-by-veronika/) »» refinement matters!

## Session 02 | Taxonomic assignment of our refined bins

Now that we are done with binning our genomes, we want to know _who are we dealing with?!_. We already got some idea about the taxonomic identity of our bins from `METAWRAP`, which uses under the [`checkM`](https://github.com/Ecogenomics/CheckM) to assess the completeness and contamination of bins based on single-copy core genes (SCGs). SCGs are "genes that commonly occur across many taxa but only with one copy". Contamination can for instance arise if we binned contigs that originated from different organisms. This is especially an issue when you are dealing with somewhat closely related organisms. 

---
❓**QUESTION**

How can we use SCGs to assess/and determine:

* genome completeness,
* genome contamination,
* and taxonomic affiliation?!

---

### Taxonomic assignment with GTDB-tk

For the taxonomic assignment, we use [`GTDB-Tk`](https://github.com/Ecogenomics/GTDBTk) relies on the Genome Taxonomy Database [`GTDB`](https://gtdb.ecogenomic.org/) that is maintained by the Australian Center for Ecogenomics.

`GTDB-Tk` is classifying genomes combining three approaches:

1. Placing them in a reference tree based on a defined set of SCGs    
» taxon-specific SCG sets are selected based on a pre-screening    
2. Calculating ANIs (average nucleotide identity) based on comparisons against reference genomes
3. Determining relative evolutionary divergence

GTDB-Tk uses big reference databases, meaning the run takes a while. Once it is done, check the output folder and check summary files.


In [None]:
### TO DO

Check out the `GTDB-tk` summary, available here `.../...`. What genomes are we dealing with?!

---
🔓**SUMMARY**

* 🎓 we talked about how to taxonomically profile short-read metagenomic data, 
* 📈 estimated coverage in our metagenome datasets,
* 🔍 assembled them,
* 🎯 binned the assembled contigs,
* 🌲 and checked their taxonomic affiliation!
  
--- 

<sub> © Carl-Eric Wegner, 2023-08 </sub>