## Table of Contents
- <a href='#1.0'>Section 1 - Introduction and Recap</a>
- <a href='#2.0'>Section 2 - Intro to Annotation</a>
    - <a href='#2.1'>Section 2.1 - What is Annotation?</a>
- <a href='#3.0'>Section 3 - Types of Annotation</a>
    - <a href='#3.1'>Section 3.1 - Structural Annotation</a>
    - <a href='#3.2'>Section 3.2 - Functional Annotation</a>

# Tutorial 3 - Genome Annotation

## <a id='1.0'>Section 1 - Introduction and Recap</a>

In the last tutorial we touched on sequencing, and how to perform genome assemblies. We also learned about the computational aspects of genome assemblies. Afterwards, we ran a BLAST search on our assembled contigs to identify which strain of E. coli they belong to. We eventually found out that the contigs belonged to a strain called K12.

In this tutorial we're going to try and describe the functions in the K12 genome through annotation.

## <a id='2.0'>Section 2 - Intro to Annotation</a>

### <a id='2.0'>Section 2.1 - What is Annotation?</a>

Annotation refers to the process of adding supplemental information onto a sequence (nucleotide or protein) so that we can better understand the underlying function/purpose of said sequence. This isn't an easy task due to the tremendous amount of processes needed to keep an organism alive. There's also the problem of mutations, a sequence may or may not change is function due to mutations, so we need to make sure the algorithms we use are robust enough to take mutations into consideration.

To ensure that the programs we use for annotation can account for these mutations, databases for protein families such as <a href='https://pfam.xfam.org/'>Pfam</a> are maintained. Many of the annotation software in use, search our sequences against some advanced statistical models called HMMs, which are also maintained on <a href='https://pfam.xfam.org/'>Pfam</a>. We won't go over HMMs in this tutorial, but you should know that they're an important method for annotation.

## <a id='3.0'>Section 3 - Types of Annotation</a>

### <a id='3.1'> Section 3.1 - Structural Annotation</a>

Structural annotation refers to locating all of the coding regions in your sequence and trying to identify the genes in those coding regions. In this tutorial, we'll be using one of the most popular programs for prokaryotic genome annotation called PROKKA.

PROKKA is a "Pipeline" or just a series of other programs that are strung together to process our data.

<img src='img/tut03/prokka.jpg'></img>

We're going to use PROKKA to annotate our `K12.fasta` file and we're going to save it in our `data/tut03/K12_annot` folder

In [None]:
!prokka K12.fasta --setupdb

In [None]:
!prokka K12.fasta --outdir data/tut03/K12_annot

Now that we have annotated the genes in our `K12.fasta` we can extract the genes that were found using this command below and save it as `genelist.txt`...

(FYI to run this command in Jupyter you need to open up a terminal window)

In [None]:
!cut -d$'\t' -f4 data/tut03/K12_annot/*.tsv | uniq >genelist.txt

### <a id='3.2'> Section 3.2 - Functional Annotation</a>

Functional Annotation allows us to gauge the function of the proteins in our sequence. This is a very useful if you want to identify any unique behaviours in your sample. An example of this could be if you want to identify whether your E. coli sample is pathogenic. It could also provide you with a summary of how your bacteria behaves in a sample. 

To do this we're going to use the ever popular <a href='http://www.geneontology.org/'>Gene Ontology Consortium</a>.

We need to download our `genelist.txt` file and copy the genes into the website.

For the next tutorial, we'll take all of our annotated files and perform some comparative genomics!