<img src="images/JHI_STRAP_Web.png" style="width: 150px; float: right;">
# 03 - Parsing

## Table of Contents

1. [Introduction](#introduction)

<a id="introduction"></a>
## Introduction

<div class="alert-success">
<b>Thus far we've looked at GenBank files by eye and in Artemis. Now we'll do this with a Python script.</b>
</div>

To recap, we've used the NCBI Entrez Programming Utilities via Biopython's ``Bio.Entrez`` to download two genomes in GenBank format,

- ``NZ_GG668576.gbk`` including our wild-type gene of interest with NCBI protein ID ``WP_004242549.1``
- ``NC_010554.gbk`` including our engineered gene of interest with NCBI protein ID ``WP_004247922.1``

What we're going to do now is pull out the nucleotide coding sequence for these genes using the CDS feature with the desired protein ID.

We'll use Biopython to parse each genome, which gives all the features as a list. We'll then loop over the list of features to find the desired CDS feature:

In [8]:
# Biopython's SeqIO module handles sequence input/output
from Bio import SeqIO

def get_cds_feature_with_protein_id(accession, protein_id):
    """Function to load a GenBank file and look for CDS feature."""
    # Ask Biopython to parse the GenBank format file 
    genome_record = SeqIO.read(accession + ".gbk", "genbank")
    # print("Loaded %s" % genome_record.description)

    # Loop over the features
    for feature in genome_record.features:
        if feature.type == "CDS" and protein_id in feature.qualifiers.get("protein_id", []):
            # print("Found feature with protein id %s" % protein_id)
            return feature
    # Could not find it
    return None

wildtype_cds = get_cds_feature_with_protein_id("NZ_GG668576", "WP_004242549.1")
print(wildtype_cds)

engineered_cds = get_cds_feature_with_protein_id("NC_010554", "WP_004247922.1")
print(engineered_cds)

type: CDS
location: [46145:47009](+)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: inference, Value: ['COORDINATES: similar to AA sequence:RefSeq:WP_004242549.1']
    Key: locus_tag, Value: ['HMPREF0693_RS00250']
    Key: note, Value: ['Derived by automated computational analysis using gene prediction method: Protein Homology.']
    Key: old_locus_tag, Value: ['HMPREF0693_0570']
    Key: product, Value: ['lipase']
    Key: protein_id, Value: ['WP_004242549.1']
    Key: transl_table, Value: ['11']
    Key: translation, Value: ['MSTKYPIVLVHGLAGFNEIVGFPYFYGIADALRQDGHQVFTASLSAFNSNEVRGKQLWQFVQTLLQETQAKKVNFIGHSQGPLACRYVAANYPDSVASVTSINGVNHGSEIADLYRRIMRKDSIPEYIVEKVLNAFGTIISTFSGHRGDPQDAIAALESLTTEQVTEFNNKYPQALPKIPGGEGDEIVNGVHYYCFGSYIQGLIAGEKGNLLDPTHAAMRVLNTFFTEKQNDGLVGRSSMRLGKLIKDDYAQDHIDMVNQVAGLVGYNEDIVAIYTQHAKYLASKQL']

type: CDS
location: [1063080:1063944](+)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: db_xref, Value: ['GeneID:6803666']
    Key: inference, Value: ['EXIST

Note that because of how we downloaded ``NZ_GG668576.gbk`` and ``NC_010554.gbk`` using NCBI Entrez, the CDS features include the protein translation. You may recall when we looked at these genomes on the NCBI website the translation was not shown.

### Resources

* [Biopython Tutorial and Cookbook](http://biopython.org/DIST/docs/tutorial/Tutorial.html)
* [Biopython Tutorial and Cookbook (PDF)](http://biopython.org/DIST/docs/tutorial/Tutorial.pdf)

 