<img src="images/JHI_STRAP_Web.png" style="width: 150px; float: right;">
# 01 - Annotation

## Table of Contents

1. [Introduction](#introduction)

<a id="introduction"></a>
## Introduction

<div class="alert-success">
<b>We'll go from UniProt to the original genomes on the NCBI.</b>
</div>

We're starting from two forms of a lipase protein available in the UniProt database:

* WT: http://www.uniprot.org/uniprot/C2LFD0 from organism *Proteus mirabilis* ATCC 29906
* Engineered: http://www.uniprot.org/uniprot/B4EVM3 from organism *Proteus mirabilis* (strain HI4320)

UniProt is by design protein centric, and now we'd like to consider the genomic context of the genes encoding these proteins. UniProt has a cross-references section which includes links to *Sequence databases* where we can find the NCBI Reference Sequence (RefSef) database entries for the source genomes.

* WT: http://www.uniprot.org/uniprot/C2LFD0 matches NCBI Reference Squence protein [WP_004242549.1](https://www.ncbi.nlm.nih.gov/protein/WP_004242549.1) from [NZ_GG668576.1](https://www.ncbi.nlm.nih.gov/nuccore/NZ_GG668576.1) *Proteus mirabilis* ATCC 29906 SCAFFOLD1, whole genome shotgun sequence
* Engineered: http://www.uniprot.org/uniprot/B4EVM3 matches multi-species protein [WP_004247922.1](https://www.ncbi.nlm.nih.gov/protein/WP_004247922.1), but UniProt specifically links to [NC_010554.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_010554.1) *Proteus mirabilis* strain HI4320, complete genome

<a id="introduction"></a>
## The GenBank format

<div class="alert-success">
<b>We'll introduce the GenBank file format used by the NCBI to share annotated genomes.</b>
</div>

Starting with our wild-type protein, clicking on the genome accession [NZ_GG668576.1](https://www.ncbi.nlm.nih.gov/nuccore/NZ_GG668576.1), we can see the whole genome shown as a webpage with clickable links based on the plain text GenBank format.

GenBank format is intended to be human readable, and is a widely used file format for annotated genomes. The related EMBL file format used in the European sequence database which mirrors the NCBI sequence database shares the same INSDC feature table design.

With bacterial genomes, for each annotated gene you expect to see a pair of features - a ``gene`` immediately followed by a ``CDS`` entry for the coding sequence, using the same location co-ordindates. With eukaryotes in addition there is usually an ``mRNA`` entry, and the CDS location is more complicated in order to describe only the exons which make up the coding sequence.

Within [NZ_GG668576.1](https://www.ncbi.nlm.nih.gov/nuccore/NZ_GG668576.1) using the browser's search we can find the NCBI's protein identifier ``WP_004242549.1`` to look at the gene annotation:

```
     gene            46146..47009
                     /locus_tag="HMPREF0693_RS00250"
                     /old_locus_tag="HMPREF0693_0570"
     CDS             46146..47009
                     /locus_tag="HMPREF0693_RS00250"
                     /old_locus_tag="HMPREF0693_0570"
                     /inference="COORDINATES: similar to AA
                     sequence:RefSeq:WP_004242549.1"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Protein Homology."
                     /codon_start=1
                     /transl_table=11
                     /product="lipase"
                     /protein_id="WP_004242549.1"
```

Likewise for our engineered protein, click on its genome accession [NC_010554.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_010554.1) and find the matching NCBI protein identifier ``WP_004247922.1``:

```
     gene            1063081..1063944
                     /locus_tag="PMI_RS04850"
                     /old_locus_tag="PMI0999"
                     /db_xref="GeneID:6803666"
     CDS             1063081..1063944
                     /locus_tag="PMI_RS04850"
                     /old_locus_tag="PMI0999"
                     /inference="EXISTENCE: similar to AA
                     sequence:RefSeq:WP_004242549.1"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Protein Homology."
                     /codon_start=1
                     /transl_table=11
                     /product="lipase"
                     /protein_id="WP_004247922.1"
                     /db_xref="GeneID:6803666"
```

We're going to work more with these GenBank files, so let's download them.

## Downloading GenBank files

You *could* download the two GenBank files we want via the NCBI website, but it takes a lot of manual steps. Currently you would start with the "send" menu to the top right of screen, picking "Complete Record", destination "File", and format "GenBank (full)". Save the two files as ``NZ_GG668576.gbk`` and ``NC_010554.gbk`` respectively.

![NCBI menu to send a record to file in GenBank format](./01-sequences/images/ncbi_send_to_file_genbank_full.png)

If you accidently use format "GenBank", then with these examples you'll get a much smaller file missing all the features and sequence data. Also, depending on your browser's settings, the file may be saved with a default name like ``sequence.gb.txt`` in your downloads folder, which means you'd have to move the file to the working folder and rename it.

Thankfully the NCBI provide a way to automate this, which they call the [NCBI Entrez Programming Utilities](https://www.ncbi.nlm.nih.gov/books/NBK25497/). This can be used from many different programming languages, but we will use it from Python using the Biopython wrapper ``Bio.Entrez`` to help.

In [None]:
# Biopython's module to access the NCBI Entrez Programming Utilities
from Bio import Entrez

# The NCBI likes to know who is using their services in case of problems,
Entrez.email = "your.name.here@example.org"

for accession in ("NZ_GG668576", "NC_010554"):
    print("Fetching %s from NCBI..." % accession)
    # Return type "gbwithparts" matches "GenBank (full)" on the website
    fetch_handle = Entrez.efetch("nuccore", id=accession, rettype="gbwithparts", retmode="text")
    with open(accession + ".gbk", "w") as output_handle:
        output_handle.write(fetch_handle.read())
    print("Saved %s.gbk" % accession)
              

If you open ``NZ_GG668576.gbk`` and ``NC_010554.gbk`` in a text editor, or simply view them at the command line using ``more``, it should look like the NCBI website's content.

### Genomic context

You can look directly at the GenBank file see what genes or other features are annotated nearby, and deduce if these genes are part of a gene locus etc. However, this is much easier to see graphically.

You might have noticed on the NCBI website display that in the second case (``WP_004247922.1`` for our engineered protein) these feature have a clickable NCBI GeneID, [GeneID:6803666](https://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Retrieve&dopt=full_report&list_uids=6803666) - that will show you the gene on the *Proteus mirabilis* strain HI4320 genome visually.

![Genomic context for GeneID:6803666](./images/genome_context_GeneID_6803666.png)

Frustratingly for our wild type protein there does not seem to be a matching NCBI gene database entry, so we need another way to vizualise the genomic context.

For this we will open the downloaded GenBank plain text files in the Sanger Institute's tool Artemis, which is able to view and edit annotation files.

### Resources

* [UniProt webpage](http://www.uniprot.org/uniprot/)
* [NCBI page about RefSeq non-redundant proteins](https://www.ncbi.nlm.nih.gov/refseq/about/nonredundantproteins/)
* [NCBI Entrez Programming Utilities](https://www.ncbi.nlm.nih.gov/books/NBK25497/)
 