# "Molecular Microbial Ecology" MBGW2.2.3

## Assessing bin completeness and contamination and winning insights into the taxonomy of your bins

Now that we are done with binning our genomes, we want to know how “good” our genome bins are, with good meaning how complete and contaminated. `METAWRAP` already provided us with some information in that regard, we now want to take a closer look.

Contamination can for instance arise if we extracted contigs that originated from different organisms. This is especially an issue when you are dealing with somewhat closely related organisms. How can we assess genome contamination and completeness? The answer are distinct sets of single-copy marker genes characteristic of taxonomic groups of interest. 

Dependent on how many of them we find in a putative genome bin we can draw conclusions about its completeness. At the same time, we can use them to deduce information about whether the genome is contaminated and to what extent. This is possible by considering the copy number of the single-copy marker genes. Basically, they should occur once, if this is not the case, if genes are present multiple times, the genome is presumably contaminated with foreign sequence information.

<div class="alert alert-block alert-warning">
<b>QUESTION/TASK:</b> 
What are single-copy core genes? can you think about obvious weaknesses of this approach?
</div>

For assessing the completeness/contamination and taxonomic affiliation we use [`checkM`]() and [`GTDB-Tk`](https://github.com/ecogenomics/gtdbtk).

`GTDB-Tk` relies on the Genome Taxonomy Database [`GTDB`](https://gtdb.ecogenomic.org/) that is maintained by the Australian Center for Ecogenomics.


In [None]:
### We move to our practical course directory and first activate the needed checkM conda environment
cd ~/data/MBGW223
conda activate checkm
checkm lineage_wf ./02_ASSEMBLY_BINNING/bin_refinement_megahit_metawrap/metawrap_70_10_bins ./02_ASSEMBLY_BINNING/checkm_metawrap_refined -t 6 --reduced_tree
checkm qa ./02_ASSEMBLY_BINNING/checkm_metawrap_refined/lineage.ms ./02_ASSEMBLY_BINNING/checkm_metawrap_refined -o 2 -f checkm_summary.tsv
### Next, we want to classify (taxonomically place our bins) with GTDB-Tk
conda deactivate
conda activate gtdbtk_1.5
gtdbtk classify_wf --genome_dir ./02_ASSEMBLY_BINNING/bin_refinement_megahit_metawrap/metawrap_70_10_bins --out_dir ./02_ASSEMBLY_BINNING/gtdbtk_metawrap_refined 

Used parameters:

`lineage_wf`: checkM comes with different workflows, lineage_wf is a workflow that assesses contamination and completeness based on lineage-specific sets of single copy core genes. checkM first identifies open reading frames, checks whether they code for single copy core genes, if yes these are extracted and matched against databases to infer the taxonomic identity of the respective genome. Based on the identified taxonomy, a set of single copy core genes is used to assess completeness/contamination.

`--reduced_tree`: checkM eats memory like cookies, in order to reduce the computational burden we use a reduced tree in the context of assigning taxonomy.

`qa`: This is basically the last step of the previous workflow that we repeat to generate text output that we can save. The workflow just shows the output in the commandline.

`--genome_dir`: Path of the folder containing your (refined) MAGs.

`--out_dir`: Path to the location where gtdbtk will save its output.

`GTDB-Tk` is classifying genomes combining three approaches:

1. Placing them in a reference tree based on a defined set of single-copy-marker genes.
2. Calculating ANIs for a collection of reference genomes.
3. Determining relative evolutionary divergence.

`GTDB-Tk` uses big reference databases, meaning the run takes a while. Once it is done, check the output folder and download the summary files (also from `checkM`).

<div class="alert alert-block alert-warning">
<b>QUESTION/TASK:</b> 
<ul>
  <li>How complete are our genomes?</li>
  <li>What is the checkM output file telling us?</li>
  <li>What taxa are we dealing with?</li>
</ul>
</div>

## Genome annnotation and metabolic potential
Genome annotation refers to the process of identifying coding (genes, ORFs), non-coding (rRNA, tRNA, siRNA, antisense RNA), as well as additional structural features (signal peptides, transmembrane helices, CRISPR arrays, etc.). Although available tools cover all of these aspects, manual curation of genome annotations is essential to yield a high quality genome annotation. 

### Genome annotation
Obviously we do not have the time for extensive manual curation, instead we will use [`prokka`](https://github.com/tseemann/prokka) as tool for rapid prokaryotic genome annotation. `prokka` uses a lot of different tools and database resources and allows to annotate a typical genomes in a few minutes. prokka has a number of optional parameters, but we will call it in a very basic way.

<img src="img/prokka.png" alt="the prokka pipeline" width="500"/>

<font size="2"> Prokka pipeline, incl. tools and dbs. </font>

In [None]:
### Let's annotate the binned genomes
cd ~/data/MBGW223
#mkdir 03_GENOM_ANNOT
conda deactivate
conda activate genome_annot
for file in ./02_ASSEMBLY_BINNING/bin_refinement_spades_metawrap/metawrap_70_10_bins/*.fa; do filename=$(basename -- "$file"); prefix="${filename%.*}"; prokka --outdir ./03_GENOM_ANNOT/"$prefix"_prokka --locustag $prefix $file; done

[18:32:34] This is prokka 1.14.6
[18:32:34] Written by Torsten Seemann <torsten.seemann@gmail.com>
[18:32:34] Homepage is https://github.com/tseemann/prokka
[18:32:34] Local time is Thu Aug 24 18:32:34 2023
[18:32:34] You are lu87neb
[18:32:34] Operating system is linux
[18:32:34] You have BioPerl 1.7.8
Argument "1.7.8" isn't numeric in numeric lt (<) at /data/mambaforge/envs/genome_annot/bin/prokka line 259.
[18:32:34] System has 48 cores.
[18:32:34] Will use maximum of 8 cores.
[18:32:34] Annotating as >>> Bacteria <<<
[18:32:34] Creating new output folder: ./03_GENOM_ANNOT/bin.10_prokka
[18:32:34] Running: mkdir -p \.\/03_GENOM_ANNOT\/bin\.10_prokka
[18:32:34] Using filename prefix: PROKKA_08242023.XXX
[18:32:34] Setting HMMER_NCPU=1
[18:32:34] Writing log to: ./03_GENOM_ANNOT/bin.10_prokka/PROKKA_08242023.log
[18:32:34] Command: /data/mambaforge/envs/genome_annot/bin/prokka --outdir ./03_GENOM_ANNOT/bin.10_prokka --locustag bin.10 ./02_ASSEMBLY_BINNING/bin_refinement_spades_metawra

<div class="alert alert-block alert-warning">
<b>QUESTION/TASK:</b> 
Have a look at the prokka output folder to get an idea about what kind of output files are generated.
<ul>
  <li>What is the difference between .gff and .gbk files?</li>
  <li>Dive into the nucleic acid fasta (.ffn), can you find any 16S rRNA genes?</li>
</ul>
</div>

### Probing potentially present 16S rRNA genes

If yes, download them and browse to NCBI blast select nucleotide blast, upload the sequences and choose 16S ribosomal RNA sequences (Bacteria/Archaea) as reference. Besides from using NCBI's BLAST suite you can also make use of resources for rRNA sequence analysis, as for instance SILVA. Browse to SILVA, upload your sequences, tick “Search and Classify” and run the aligner. Does the 16S rRNA gene-based taxonomy match the taxonomy determined by `GTDB-Tk`?

### Assessing the metabolic potential | KEGG pathway mapping

Let's (superficially) check the metabolic potential of our bins. In principle, this is possible by taking our genome annotations and checking which functions are encoded. This is extremely laborious. If you are interested in genes linked to a particular pathway/function (e.g. autotrophy) you would be forced to check the annotation for all the enzymatic functions involved in that. 

In order to bypass this issue, we take advantage of [`KEGG`](https://www.genome.jp/kegg/), the Kyoto Encyclopedia of Genomes and Genes. Among many other things, KEGG contains a sophisticated collection of pathway maps, which can be used for metabolic reconstructions. Each enzymatic function of a particular pathway has an asssigned K-number. If you have a list of K-numbers you can map them on the KEGG pathway maps as a whole and check which pathways they are linked to. 

This is exactly what we want to do now. In order to get lists of K-numbers for our bins we have to run database searches for all encoded amino acid sequences against the KEGG database.

<div class="alert alert-block alert-warning">
<b>QUESTION/TASK:</b> 
<ul>
  <li>Move all .faa files from the prokka output folders into one folder.</li>
  <li>Concatenate them cat *.faa > bins_AAseqs.faa</li>
  <li>Download this concatenated file and upload it to [`BlastKOALA`](https://www.kegg.jp/blastkoala/)</li>
</ul>
</div>

![BLASTKOALA mask](img/blastkoala.png)
<font size="2"> The BLASTKOALA job submission mask. </font>

* Check your mails, confirm the BlastKOALA job and wait for the results
* Once the job is done you can download the results (a tab-separated table, gene ids \t K-number) and upload them to [`KEGG Mapper`](https://www.genome.jp/kegg/tool/map_pathway.html).

![BLASTKOALA mask](img/keggmapper.png)

<font size="2"> The KEGG MAPPER job submission mask. </font>

<div class="alert alert-block alert-warning">
<b>QUESTION/TASK:</b> 
Go through all bins and take a look at selected metabolic aspects:
<ul>
  <li>Do all genome bins encode the complete set of genes for glycolysis? » Check “Carbohydrate metabolism”</li>
  <li>Do some bins have a broader substrate utilization potential? » Check “Carbohydrate metabolism”</li>
  <li>Do all bins encode peptidoglycan biosynthesis? » Check “Glycan biosynthesis”</li>
  <li>Do you find pathways of biogeochemical relevance? » Check “Methane/Nitrogen metabolism”</li>
  <li>Is there any bin with an extended potential for secondary metabolite biosynthesis? » Check "Type I polyketide structures"</li>
  <li>Does one of the bins feature eukaryotic signature proteins? » Check "Protein export" & "Viral life cycle - HIV1"</li>
</ul>
</div>

<sub> © Carl-Eric Wegner, 2023-08 </sub>