<img src="images/JHI_STRAP_Web.png" style="width: 150px; float: right;">
# 01 - Introduction to sequence data and bioinformatics

## Table of Contents

1. [Introduction](#introduction)
2. [FASTA format](#fasta_format)
3. [GenBank format](#genbank_format)
3. [Downloading GenBank files](#downloading_gb)

<a id="introduction"></a>
## Introduction

<div class="alert-success">
<b>We've come up with a little example to motivate the specific sample data we we be using.</b>
</div>

In the course of this workshop we're going to be looking at two forms of a lipase protein from the bacteria *Proteus mirabilis*, both the natural wild-type and an engineered form of this enzyme.

To prepare for this we're going to first have to introduce some widely used sequence file formats for storing nucleotide and amino acid sequence data, or entire genome sequences.

<a id="fasta"></a>
## FASTA format

<p></p>
<div class="alert-success">
<b>What is FASTA format?</b>
</div>

The FASTA format (named after an early bioinformatics tool of the same name) uses a special ``>`` marker line to indicate the start of each sequence. This ``>`` header line should begin with an identifier, and then optional a space and description (all one one line). The subsequence lines (until the next ``>`` marker) are the associated sequence data, usually line wrapped (but the line wrapping has no meaning).

In a new terminal window, please change to this data directory using:

``` bash
$ cd ~/Teaching-IBioIC-Intro-to-Bioinformatics/01-introduction/data
```

If you list the *.fasta files, you should see:

``` bash
$ ls *.fasta
engineered_nt.fasta     glycoside_hydrolases_aa.fasta wildtype_nt.fasta
```

The wildtype.fasta file should look like this using the more command:

``` bash
$ more wildtype_nt.fasta
>wildtype lipase protein from Proteus mirabilis
ATGAGCACCAAGTACCCCATCGTGCTGGTGCACGGCCTGGCCGGCTTCAACGAGATCGTG
GGCTTCCCCTACTTCTACGGCATCGCCGACGCCCTGAGGCAGGACGGCCACCAGGTGTTC
ACCGCCAGCCTGAGCGCCTTCAACAGCAACGAGGTGAGGGGCAAGCAGCTGTGGCAGTTC
GTGCAGACCCTGCTGCAGGAGACCCAGGCCAAGAAGGTGAACTTCATCGGCCACAGCCAG
GGCCCCCTGGCCTGCAGGTACGTGGCCGCCAACTACCCCGACAGCGTGGCCAGCGTGACC
AGCATCAACGGCGTGAACCACGGCAGCGAGATCGCCGACCTGTACAGGAGGATCATGAGG
AAGGACAGCATCCCCGAGTACATCGTGGAGAAGGTGCTGAACGCCTTCGGCACCATCATC
AGCACCTTCAGCGGCCACAGGGGCGACCCCCAGGACGCCATCGCCGCCCTGGAGAGCCTG
ACCACCGAGCAGGTGACCGAGTTCAACAACAAGTACCCCCAGGCCCTGCCCAAGACCCCC
GGCGGCGAGGGCGACGAGATCGTGAACGGCGTGCACTACTACTGCTTCGGCAGCTACATC
CAGGGCCTGATCGCCGGCGAGAAGGGCAACCTGCTGGACCCCACCCACGCCGCCATGAGG
GTGCTGAACACCTTCTTCACCGAGAAGCAGAACGACGGCCTGGTGGGCAGGAGCAGCATG
AGGCTGGGCAAGCTGATCAAGGACGACTACGCCCAGGACCACATCGACATGGTGAACCAG
GTGGCCGGCCTGGTGGGCTACAACGAGGACATCGTGGCCATCTACACCCAGCACGCCAAG
TACCTGGCCAGCAAGCAGCTG
```

The engineered.fasta file should look like this with more:

``` bash
$ more engineered_nt.fasta
>engineered lipase protein from Proteus mirabilis
ATGAGCACCAAGTACCCCATCGTGCTGGTGCACGGCCTGGCCGGCTTCAGCGAGATCGTG
GGCTTCCCCTACTTCTACGGCATCGCCGACGCCCTGACCCAGGACGGCCACCAGGTGTTC
ACCGCCAGCCTGAGCGCCTTCAACAGCAACGAGGTGAGGGGCAAGCAGCTGTGGCAGTTC
GTGCAGACCATCCTGCAGGAGACCCAGACCAAGAAGGTGAACTTCATCGGCCACAGCCAG
GGCCCCCTGGCCTGCAGGTACGTGGCCGCCAACTACCCCGACAGCGTGGCCAGCGTGACC
AGCATCAACGGCGTGAACCACGGCAGCGAGATCGCCGACCTGTACAGGAGGATCATCAGG
AAGGACAGCATCCCCGAGTACATCGTGGAGAAGGTGCTGAACGCCTTCGGCACCATCATC
AGCACCTTCAGCGGCCACAGGGGCGACCCCCAGGACGCCATCGCCGCCCTGGAGAGCCTG
ACCACCGAGCAGGTGACCGAGTTCAACAACAAGTACCCCCAGGCCCTGCCCAAGACCCCC
TGCGGCGAGGGCGACGAGATCGTGAACGGCGTGCACTACTACTGCTTCGGCAGCTACATC
CAGGAGCTGATCGCCGGCGAGAACGGCAACCTGCTGGACCCCACCCACGCCGCCATGAGG
GTGCTGAACACCCTGTTCACCGAGAAGCAGAACGACGGCCTGGTGGGCAGGTGCAGCATG
AGGCTGGGCAAGCTGATCAAGGACGACTACGCCCAGGACCACTTCGACATGGTGAACCAG
GTGGCCGGCCTGGTGAGCTACAACGAGAACATCGTGGCCATCTACACCCTGCACGCCAAG
TACCTGGCCAGCAAGCAGCTG
```

Here we have two short FASTA files, each just five lines long, and each containing a single nucelotide sequence - which by eye look almost identical. We will come back to this later.

FASTA files can contain much longer sequences - like whole chromosomes.

FASTA files often contain multiple sequences - like all the proteins from a bacteria, all the gene coding seqeuences, or any hand compiled set of nucleotide sequences of interest. Have a look at the third file, ``glycoside_hydrolases_nt.fasta`` for comparison:

```bash
$ more glycoside_hydrolases_nt.fasta
>ECA0662 6-phospho-beta-glucosidase
ATGAAAGCATTCCCCGACGGATTTTTATGGGGCGGTTCAGTCGCAGCAAATCAGGTTGAA
GGGGCATGGAATGAAGACGGCAAAGGCGTGTCGACCTCCGATCTTCAGCTAAAGGGCGTG
CATGGTCCGGTGACAGAACGCGATGAGAGCATTAGCTGCATCAAAGATCGGGCAATCGAT
...
```

You should find this contains eight nucleotide sequences. We'll look at the genome these came from soon, the bacteria *Pectobacterium carotovorum* accession [NC_004547.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_004547.2) (originally known as *Erwinia carotovora*).

Most bioinformatics tools for working on sequence data will accept FASTA format input.

<a id="genbank_format"></a>
## GenBank format

When a new genome sequence gets published, it should be deposited in the global sequence archive under the International Nucleotide Sequence Database Collaboration (INSDC), via one of the three mirrors run by the NCBI in America, EMBL-EBI in Europe, and DDBJ in Japan.

We're going to be using the NCBI website here, and the NCBI default view here shows the whole genome as a webpage with clickable links based on the plain text NCBI GenBank format.

GenBank format is intended to be human readable (see this [sample record](https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html)), and is a widely used file format for annotated genomes. The related EMBL file format used in the European sequence database which mirrors the NCBI sequence database shares the same INSDC feature table design.

A sequence record in GenBank format has three main sections:

- header starting with the ``LOCUS`` line 
- feature table listing any annotated features like genes
- the actual sequence (or how it is built up from other records), ending with a ``//`` line.

With bacterial genomes, for each annotated gene you expect to see a pair of features - a ``gene`` immediately followed by a ``CDS`` entry for the coding sequence, using the same location co-ordindates. With eukaryotes in addition there is usually an ``mRNA`` entry, and the CDS location is more complicated in order to describe only the exons which make up the coding sequence.

Here we've picked a *Pectobacterium carotovorum* (previously known as *Erwinia carotovora*) genome as an example, [NC_004547.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_004547.2). You can scroll down to see the genes, or search within the page once it has fully loaded.

Supposing you were interested in the gene with original locus tag ``ECA0662`` (which is the first entry in example file ``glycoside_hydrolases_nt.fasta``), you should find it here:


```
     gene            complement(736847..738235)
                     /locus_tag="ECA_RS03295"
                     /old_locus_tag="ECA0662"
                     /db_xref="GeneID:2881615"
     CDS             complement(736847..738235)
                     /locus_tag="ECA_RS03295"
                     /old_locus_tag="ECA0662"
                     /inference="EXISTENCE: similar to AA
                     sequence:RefSeq:WP_011092278.1"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Protein Homology."
                     /codon_start=1
                     /transl_table=11
                     /product="6-phospho-beta-glucosidase"
                     /protein_id="WP_039289952.1"
                     /db_xref="GeneID:2881615"
```

We will talk about the meaning of this entry, and the gene context. But first, since we're going to work more with this GenBank file directly, we'll download it first.

<a id="downloading_gb"></a>
## Downloading GenBank files

You *could* download the the GenBank file we want via the NCBI website, but it takes a lot of manual steps. Currently you would start with the "send" menu to the top right of screen, picking "Complete Record", destination "File", and format "GenBank (full)". Save the two files as ``NC_004547.gbk``.

![NCBI menu to send a record to file in GenBank format](./images/ncbi_send_to_file_genbank_full.png)

If you accidently use format "GenBank", then with these examples you'll get a much smaller file missing all the features and sequence data. Also, depending on your browser's settings, the file may be saved with a default name like ``sequence.gb.txt`` in your downloads folder, which means you'd have to move the file to the working folder and rename it.

Thankfully the NCBI provide a way to automate this, which they call the [NCBI Entrez Programming Utilities](https://www.ncbi.nlm.nih.gov/books/NBK25497/). This can be used from many different programming languages, but we will use it from Python using the Biopython wrapper ``Bio.Entrez`` to help.

In [None]:
########################################################################
# For the purposes of this workshop, don't worry about how this works, #
# a key point here is if you run this you'll all get the same file.    #
########################################################################

# Biopython's module to access the NCBI Entrez Programming Utilities
from Bio import Entrez

# The NCBI likes to know who is using their services in case of problems,
Entrez.email = "your.name.here@example.org"

accession = "NC_004547"

print("Fetching %s from NCBI..." % accession)

# Return type "gbwithparts" matches "GenBank (full)" on the website
fetch_handle = Entrez.efetch("nuccore", id=accession, rettype="gbwithparts", retmode="text")

# Open an output file, and write all the data from the NCBI to it
with open(accession + ".gbk", "w") as output_handle:
    output_handle.write(fetch_handle.read())

print("Saved %s.gbk" % accession)

If you look at ``NC_004547.gbk`` in a text editor, or simply view it at the command line using ``more``, it should look like the NCBI website's content - although there are some interesting differences...

### Resources

* [NCBI Entrez Programming Utilities](https://www.ncbi.nlm.nih.gov/books/NBK25497/)