<img src="images/JHI_STRAP_Web.png" style="width: 150px; float: right;">
# 03 - Parsing

## Table of Contents

1. [Introduction](#introduction)

<a id="introduction"></a>
## Introduction

<div class="alert-success">
<b>Thus far we've looked at GenBank files by eye and in Artemis. Now we'll do this with a Python script.</b>
</div>

To recap, we've used the NCBI Entrez Programming Utilities via Biopython's ``Bio.Entrez`` to download two genomes in GenBank format,

- ``NZ_GG668576.gbk`` including our wild-type gene of interest with NCBI protein ID ``WP_004242549.1``
- ``NC_010554.gbk`` including our engineered gene of interest with NCBI protein ID ``WP_004247922.1``

What we're going to do now is pull out the nucleotide coding sequence for these genes using the CDS feature with the desired protein ID.

We'll use Biopython to parse each genome, which gives all the features as a list. We'll then loop over the list of features to find the desired CDS feature:

In [1]:
# Biopython's SeqIO module handles sequence input/output
from Bio import SeqIO

def get_cds_feature_with_protein_id(seq_record, protein_id):
    """Function to look for CDS feature by protein id in sequence record."""
    # Loop over the features
    for feature in genome_record.features:
        if feature.type == "CDS" and protein_id in feature.qualifiers.get("protein_id", []):
            # print("Found feature with protein id %s" % protein_id)
            return feature
    # Could not find it
    return None

wildtype_genome = genome_record = SeqIO.read("NZ_GG668576.gbk", "genbank")
wildtype_cds_feature = get_cds_feature_with_protein_id(wildtype_genome, "WP_004242549.1")
print(wildtype_cds_feature)

engineered_genome = genome_record = SeqIO.read("NC_010554.gbk", "genbank")
engineered_cds_feature = get_cds_feature_with_protein_id(engineered_genome, "WP_004247922.1")
print(engineered_cds_feature)

type: CDS
location: [46145:47009](+)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: inference, Value: ['COORDINATES: similar to AA sequence:RefSeq:WP_004242549.1']
    Key: locus_tag, Value: ['HMPREF0693_RS00250']
    Key: note, Value: ['Derived by automated computational analysis using gene prediction method: Protein Homology.']
    Key: old_locus_tag, Value: ['HMPREF0693_0570']
    Key: product, Value: ['lipase']
    Key: protein_id, Value: ['WP_004242549.1']
    Key: transl_table, Value: ['11']
    Key: translation, Value: ['MSTKYPIVLVHGLAGFNEIVGFPYFYGIADALRQDGHQVFTASLSAFNSNEVRGKQLWQFVQTLLQETQAKKVNFIGHSQGPLACRYVAANYPDSVASVTSINGVNHGSEIADLYRRIMRKDSIPEYIVEKVLNAFGTIISTFSGHRGDPQDAIAALESLTTEQVTEFNNKYPQALPKIPGGEGDEIVNGVHYYCFGSYIQGLIAGEKGNLLDPTHAAMRVLNTFFTEKQNDGLVGRSSMRLGKLIKDDYAQDHIDMVNQVAGLVGYNEDIVAIYTQHAKYLASKQL']

type: CDS
location: [1063080:1063944](+)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: db_xref, Value: ['GeneID:6803666']
    Key: inference, Value: ['EXIST

Using your text editor you can compare this to the matching sections in the GenBank files themselves, from ``NZ_GG668576.gbk`` we have:

```
     CDS             46146..47009
                     /locus_tag="HMPREF0693_RS00250"
                     /old_locus_tag="HMPREF0693_0570"
                     /inference="COORDINATES: similar to AA
                     sequence:RefSeq:WP_004242549.1"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Protein Homology."
                     /codon_start=1
                     /transl_table=11
                     /product="lipase"
                     /protein_id="WP_004242549.1"
                     /translation="MSTKYPIVLVHGLAGFNEIVGFPYFYGIADALRQDGHQVFTASL
                     SAFNSNEVRGKQLWQFVQTLLQETQAKKVNFIGHSQGPLACRYVAANYPDSVASVTSI
                     NGVNHGSEIADLYRRIMRKDSIPEYIVEKVLNAFGTIISTFSGHRGDPQDAIAALESL
                     TTEQVTEFNNKYPQALPKIPGGEGDEIVNGVHYYCFGSYIQGLIAGEKGNLLDPTHAA
                     MRVLNTFFTEKQNDGLVGRSSMRLGKLIKDDYAQDHIDMVNQVAGLVGYNEDIVAIYT
                     QHAKYLASKQL"

```

and from ``NC_010554.gbk``:

```
     CDS             1063081..1063944
                     /locus_tag="PMI_RS04850"
                     /old_locus_tag="PMI0999"
                     /inference="EXISTENCE: similar to AA
                     sequence:RefSeq:WP_004242549.1"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Protein Homology."
                     /codon_start=1
                     /transl_table=11
                     /product="lipase"
                     /protein_id="WP_004247922.1"
                     /db_xref="GeneID:6803666"
                     /translation="MSTKYPIVLVHGLAGFNEIVGFPYFYGIADALRQDGHQVFTASL
                     SAFNSNEVRGKQLWQFVQTLLQETQAKKVNFIGHSQGPLACRYVAANYPDSVASVTSI
                     NGVNHGSEIADLYRRIMRKDSIPEYIVEKVLNAFGTIISTFSGHRGDPQDAIAALESL
                     TTEQVTEFNNKYPQALPKTPGGEGDEIVNGVHYYCFGSYIQGLIAGEKGNLLDPTHAA
                     MRVLNTFFTEKQNDGLVGRSSMRLGKLIKDDYAQDHIDMVNQVAGLVGYNEDIVAIYT
                     QHAKYLASKQL"
```

Note that because of how we downloaded ``NZ_GG668576.gbk`` and ``NC_010554.gbk`` using NCBI Entrez, the CDS features include the protein translation. You may recall when we looked at these genomes on the NCBI website the translation was not shown.

### Locations in GenBank format

The first line of a GenBank (or EMBL) feature gives the co-ordinates. Here we're lucky that both are simple cases on the forward strand, ``46146..47009`` and ``1063081..1063944`` meaning ``start..end`` using inclusive one-based counting. You'll often see ``complement(start..end)`` for features on the reverse strand, and more complicated CDS locations using ``join(...)`` are common in eurkaroytes to describe slicing.

The way that Python (and many other programming languages) slices strings or arrays of data counts from zero but excludes the end point, which is why in Biopython these locations seem to start one base earlier.

In [2]:
print(wildtype_cds_feature.location)
print(engineered_cds_feature.location)

[46145:47009](+)
[1063080:1063944](+)


Rather than messing about with the start/end coordinates ourselves (and worrying about counting from zero or counting from one), we can get Biopython to apply the location information from the feature to extract the described region of the genome sequence.

In [4]:
wildtype_nucl = wildtype_cds_feature.extract(wildtype_genome.seq)
engineered_nucl = engineered_cds_feature.extract(engineered_genome.seq)

### Translating sequences

Once we have the nucleotides, it is just one line to get Biopython to translate it. By default this would use the "Standard" genetic code (for humans etc), so we should explicitly specify we want to use the bacterial table. The GenBank annotation tells us we should use NCBI translation table 11 - see the [NCBI's list of genetic codes](https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi).

In [23]:
wildtype_prot = wildtype_nucl.translate(table=11, cds=True)
print("Translated into amino acids:")
print(wildtype_prot)

Translated into amino acids:
MSTKYPIVLVHGLAGFNEIVGFPYFYGIADALRQDGHQVFTASLSAFNSNEVRGKQLWQFVQTLLQETQAKKVNFIGHSQGPLACRYVAANYPDSVASVTSINGVNHGSEIADLYRRIMRKDSIPEYIVEKVLNAFGTIISTFSGHRGDPQDAIAALESLTTEQVTEFNNKYPQALPKIPGGEGDEIVNGVHYYCFGSYIQGLIAGEKGNLLDPTHAAMRVLNTFFTEKQNDGLVGRSSMRLGKLIKDDYAQDHIDMVNQVAGLVGYNEDIVAIYTQHAKYLASKQL


Here we also told Biopython to interpret this as a complete CDS, meaning it checks there is a whole number of codons (the sequence is a multiple of three in length), verifies the last codon is a stop codon, and also ensures even if an alternative start codon is used it becomes a methione (``M``). In both these examples, it turns out the start codon is the typical ``ATG``:

In [14]:
print("Start codon is %s" % wildtype_nucl[:3])  # Python's way to get first three letters
print("Stop codon is %s" % wildtype_nucl[-3:])  # Python trick for last three letters

Start codon is ATG
Stop codon is TAA


<a id="ex01"></a>
<img src="./images/exercise.png" style="width: 40px; float: left;">

In the same way, let's get the second gene's nucleotide sequence and translate it:

In [16]:
engineered_nucl = engineered_cds_feature.extract(engineered_genome.seq)
engineered_prot = engineered_nucl.translate(table=11, cds=True)
print(engineered_prot)

MSTKYPIVLVHGLAGFNEIVGFPYFYGIADALRQDGHQVFTASLSAFNSNEVRGKQLWQFVQTLLQETQAKKVNFIGHSQGPLACRYVAANYPDSVASVTSINGVNHGSEIADLYRRIMRKDSIPEYIVEKVLNAFGTIISTFSGHRGDPQDAIAALESLTTEQVTEFNNKYPQALPKTPGGEGDEIVNGVHYYCFGSYIQGLIAGEKGNLLDPTHAAMRVLNTFFTEKQNDGLVGRSSMRLGKLIKDDYAQDHIDMVNQVAGLVGYNEDIVAIYTQHAKYLASKQL


### Comparing sequences

Let's double check our translations match that given in the annotation:

In [24]:
print(wildtype_prot == wildtype_cds_feature.qualifiers["translation"][0])
print(engineered_prot == engineered_cds_feature.qualifiers["translation"][0])

True
True


Let's also double check our translations match that in the original UniProt FASTA files we stated with:

In [25]:
wildtype_uniprot = SeqIO.read("C2LFD0.fasta", "fasta")
engineered_uniprot = SeqIO.read("B4EVM3.fasta", "fasta")
print(wildtype_prot == wildtype_uniprot.seq)
print(engineered_prot == engineered_uniprot.seq)

True
True


In [22]:
print(wildtype_prot == engineered_prot)
print(len(wildtype_prot))
print(len(engineered_prot))

False
287
287


We know while our two protein sequences look similar and have the same length, they differ - but how? This snippet of Python shows there is a single base pair difference about two-thirds of the way along. This will be important later.

In [19]:
print(wildtype_prot)
# This is some advanced Python using a generator expression.
# It prints out "." where the two proteins match, and "X" where they differ.
print("".join(("." if a==b else "X") for a, b in zip(wildtype_prot, engineered_prot)))
print(engineered_prot)

MSTKYPIVLVHGLAGFNEIVGFPYFYGIADALRQDGHQVFTASLSAFNSNEVRGKQLWQFVQTLLQETQAKKVNFIGHSQGPLACRYVAANYPDSVASVTSINGVNHGSEIADLYRRIMRKDSIPEYIVEKVLNAFGTIISTFSGHRGDPQDAIAALESLTTEQVTEFNNKYPQALPKIPGGEGDEIVNGVHYYCFGSYIQGLIAGEKGNLLDPTHAAMRVLNTFFTEKQNDGLVGRSSMRLGKLIKDDYAQDHIDMVNQVAGLVGYNEDIVAIYTQHAKYLASKQL
..................................................................................................................................................................................X............................................................................................................
MSTKYPIVLVHGLAGFNEIVGFPYFYGIADALRQDGHQVFTASLSAFNSNEVRGKQLWQFVQTLLQETQAKKVNFIGHSQGPLACRYVAANYPDSVASVTSINGVNHGSEIADLYRRIMRKDSIPEYIVEKVLNAFGTIISTFSGHRGDPQDAIAALESLTTEQVTEFNNKYPQALPKTPGGEGDEIVNGVHYYCFGSYIQGLIAGEKGNLLDPTHAAMRVLNTFFTEKQNDGLVGRSSMRLGKLIKDDYAQDHIDMVNQVAGLVGYNEDIVAIYTQHAKYLASKQL


### Resources

* [Biopython Tutorial and Cookbook](http://biopython.org/DIST/docs/tutorial/Tutorial.html)
* [Biopython Tutorial and Cookbook (PDF)](http://biopython.org/DIST/docs/tutorial/Tutorial.pdf)
* [NCBI's list of genetic codes](https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi)

 