# 3. Annotation

Author: Sandra Godinho Silva \
Most updated version: 0.2 from 07/09/2020 

In [534]:
#import libraries
import pandas as pd
import numpy as np
import seaborn as sns

In [535]:
pd.set_option('display.max_rows', 5)
pd.set_option('display.max_columns', 1000)

## 1 Prokka

Prokka: https://github.com/tseemann/prokka

**1.1 Run script that formats .fasta files by shortening contig headers: contig_nammer.py** \
Utility: Fasta files downloaded from NCBI sometimes have long contig headers, making the files unsuitable as input to run Prokka. Running this script on the fasta files solves the problem.

**1.2 File division into subdirectories** \
Divide genomes into folders with 5 genomes before job submission to increase job speed on EVE cluster.
Input: folder with the genomes in fasta format.

**1.3 For loop submission on eve:** \
Input: folder containing all the folders created in the last step. Each folder contains 5 genomes.

In [None]:
find /data/msb/silva/NEW/3_Annotation/Prokka -type f -name '*.gbk' -print0 | xargs -0 -I % cp % .

In [None]:
find . -name \*.gbk -exec cp {} /data/msb/silva/Prokka_ano/prokka_out/Gbk/ \;

# 2 Pfam

**2.1 To annotate the genomes into Pfams, a local database is created.** \
Download lastest Pfam-A.hmm:

**2.2 Run hmmsearch** \
To run hhmsearch is necessary to install Hmmer. Creating a conda environment is the easiest option:

**2.3 Convert tblout files**

In [None]:
python /data/msb/silva/NEW/3_Annotation/tblout2.py .

# 3 COG
Adapted from https://github.com/aleimba/bac-genomics-scripts/tree/master/cdd2cog \
e value: 0.001 (1e-5) 

For now cite the latest major release (tag: bovine_ecoli_mastitis) hosted on Zenodo:

Leimbach A. 2016. bac-genomics-scripts: Bovine E. coli mastitis comparative genomics edition. Zenodo. http://dx.doi.org/10.5281/zenodo.215824.

**3.1 Installation:**

**3.2 In the folder with the Faa files from Prokka: rpsblast** 

**3.2 In the folder with the output files from rpsblast: Prokkacdd2cog**

# 4 CAZymes
To annotate the genomes in terms of CAZymes, the following script was used: https://github.com/linnabrown/run_dbcan \
**4.1 In the folder with tha Faa files from Prokka: run_dbcan**


# 5 KEGG

To get the Kegg annotation, a script that converts Prokka annotation into KOs is used: https://github.com/SilentGene/Bio-py/tree/master/prokka2kegg \
For that purpose, KO entries (K numbers in KEGG annotation) are assigned *in batch mode* according to UniProtKB ID in `Prokka` *.gbk files

**5.1 Prepare the cross-reference database provided by UniProt**

**5.2 In the folder with tha Gbk files from Prokka: prokka2kegg.py** \

# 6 Create count tables and merge annotations

#### Run **orf_annotation.py**

# 7 antiSMASH

**7.1 Format .gbk file headers to use them as input in antiSMASH: gbk_file_formater.py** \
Utility: Output files from Prokka annotation can have some incompatibilities with the required antiSMASH input.

**7.2 Create output directories:**

**7.3 Run antismash** \
options chosen: 

# 8 BIG-SCAPE

---

# References

**Prokka** \
Seemann, T. “Prokka: Rapid Prokaryotic Genome Annotation.” Bioinformatics 30, no. 14
(July 15, 2014): 2068–69. https://doi.org/10.1093/bioinformatics/btu153 .

**antiSMASH** \
antiSMASH 5.0: updates to the secondary metabolite genome mining pipeline
Kai Blin, Simon Shaw, Katharina Steinke, Rasmus Villebro, Nadine Ziemert, Sang Yup Lee, Marnix H Medema, & Tilmann Weber. Nucleic Acids Research (2019) https://doi.org/10.1093/nar/gkz310 .