# Accessing & Working with DNA, RNA & Protein Sequences

In this notebook we will start working with biological sequences by retreiving records, looking at their structure and the information that is associated with them. We will also start manipulating the sequences and performing some basic analysis to become more familiar with the sorts of operations and processes we can perform.

We have included web links were appropriate to additional information and web based resrouces that can be used to either replace or complement working in the Python environment. It is absolutely fine to use web based tools to perform Bioinformatic work, but those tools are often limited in their functionality in ways that eventually become problematic in real-life anaysis situations. This is why, if you would like to pursue further study and/or research in Bioinformatics and related disciplines it is a good plan to begin learning the two core programming languages that are in common use, namely [Python](https://www.learnpython.org) and the Statistical programming language [R](https://cran.r-project.org).

In [None]:
# install and/or load BioPython
!pip install biopython

First we load the Entrez module from BioPython.

You can read the description of this module [here](https://biopython.org/DIST/docs/api/Bio.Entrez-module.html)

In [None]:
from Bio import Entrez

Entrez.email = "A.N.Other@example.com" # You should replace this with your e-mail address 

# note the egquery function provides Entrez database counts from a global search.
handle = Entrez.egquery(term="Cypripedioideae")
record = Entrez.read(handle)
handle.close()

print(type(record))

# Look at what is inside the record object
print(record.keys())

# The first contains the search term
print(record['Term'])

# The second contains a list of results from different Entrez Databases
for row in record['eGQueryResult']:
    print(row)

# we can iterate through the record and only return the 'nucleotide' result
for row in record["eGQueryResult"]:
    if row["DbName"]=="nuccore":
        print('***',row)
        # print just how many nucleotide entries there are
        print(row["Count"])

Note the number of nucleotide sequences returned and compare it to the result you get if you seach for "Cypripedioideae" using the [Entrez Search Webpage](https://www.ncbi.nlm.nih.gov/search/). For interest, these are a sub-family of Orchid (one member is the [Lady's Slipper Orchid](https://en.wikipedia.org/wiki/Cypripedium_calceolus))

Lets now select a particular sequence and download it for further analysis.

In [None]:
from Bio import Entrez

Entrez.email = "A.N.Other@example.com" # You should replace this with your e-mail address 

# we're going to search for up to 1000 sequences and we're going to ask for the accession number for each

# note the Entrez esearch function searches and returns a handle to the results.
handle = Entrez.esearch(db='nucleotide',term="Cypripedioideae",retmax=1000,idtype='acc')
record = Entrez.read(handle)
handle.close()

#look at the first 10 ids
print(record['IdList'][:10])

In [None]:
#lets fetch one
accession = record['IdList'][500]

handle = Entrez.efetch(db="nucleotide", id=accession, retmode="xml")
entry = Entrez.read(handle)
handle.close()

#print the whole entry (this is a GenBank record in XML format)
print(entry)

In [None]:
print(entry[0]['GBSeq_definition'])
print(entry[0]['GBSeq_organism'])

We can retreive the record in a more user-friendly format

In [None]:
handle = Entrez.efetch(db="nuccore", id=accession, rettype="gb", retmode="text")
print(handle.read())

We can use the Bio.SeqIO module which handles groups of records to capture the search and create a Bio.Seq.Seq sequence object

In [None]:
from Bio import SeqIO
handle = Entrez.efetch(db="nuccore", id=accession, rettype="gb", retmode="text")
records = SeqIO.parse(handle, "gb")

for entry in records:
    sequence = entry.seq
    print(sequence)
    print(type(sequence))
    
print('complement',sequence.complement())
print('reverse_complement',sequence.reverse_complement())

The real power of this system comes when you want to search and work with a lot of sequences.

Lets say we want to search for Gene entries for Pax6

In [None]:
#search for

from Bio import Entrez

Entrez.email = "A.N.Other@example.com" # You should replace this with your e-mail address 

# we're going to limit this to 100 sequences and we're going to ask for the accession number for each

# note the Entrez esearch function searches and returns a handle to the results.
handle = Entrez.esearch(db='nucleotide',term="Pax6[Gene]",retmax=100)
record = Entrez.read(handle)
handle.close()

#look at the first 10 ids
print(record['IdList'][:10])

In [None]:
# now lets fetch them all, to do this we extract the accession id list

gi_list = record['IdList'][:10]

#then turn it into a comma-separated string

gi_str = ",".join(gi_list)

handle = Entrez.efetch(db="nucleotide", id=gi_str, rettype="gb", retmode="text")
records = SeqIO.parse(handle, "gb")

for record in records:
    print("%s, length %i, from organism %s" % (record.name, len(record), record.description))

Now we're going to pull a full gene entry for human Pax6 from Genbank and look at it, we can also do this online by clicking [here](https://www.ncbi.nlm.nih.gov/nuccore/208879460).

In [None]:
from Bio import Entrez

Entrez.email = "A.N.Other@example.com"  # Always tell NCBI who you are
handle = Entrez.efetch(db="nucleotide", id="208879460", rettype="gb", retmode="text")
gb_entry = handle.read()
handle.close()

#NB this is just a straight string at this point (as we just read() it straight into a string object)
print(gb_entry)

Now we're going to extract the coding sequence from this entry and translate it into protein

In [None]:
from Bio import SeqIO
from Bio import Entrez

Entrez.email = "A.N.Other@example.com"  # Always tell NCBI who you are
handle = Entrez.efetch(db="nucleotide", id="208879460", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")

if record.features:
    for feature in record.features:
        #this tag identifies the CoDingSequences from the record
        if feature.type == "CDS":
            print(feature.qualifiers["protein_id"])
            print(feature.location,'\n')
            current_sequence = feature.location.extract(record).seq
            print('Nucleotide Sequence')
            print(current_sequence,'\n')
            #translate the current sequence into protein
            print('Protein Sequence')
            print(current_sequence.translate(),'\n')


In [None]:
from Bio import Entrez

Entrez.email = "ian.simpson@ed.ac.uk" 

# note the Entrez esearch function searches and returns a handle to the results.
handle = Entrez.esearch(db='gene',term="Nrg1[Gene] AND human",retmax=100)
record = Entrez.read(handle)
handle.close()

#look at the first 10 ids
print(record['IdList'][:10])

handle = Entrez.efetch(db="gene", id=record['IdList'][:1],retmode='xml')
gb_entry = handle
handle.close()

#NB this is just a straight string at this point (as we just read() it straight into a string object)
print(type(gb_entry))

print(gb_entry)

### Challenge 1 - Finding Genes with NCBI-Entrez
Using either the Entrez website to search and/or using what you've learned about BioPython's abilities to query NCBI services retreive entries for a gene called Nrg1.
- How many different gene entries are there for this gene in NCBI databases?
- What is the full name of this gene?
- What kind of protein does this gene encode?

In [None]:
from Bio import Entrez

Entrez.email = "anybody.else@internet.com" # You should replace this with your e-mail address


#find out how many gene entries there are for Nrg1
handle = Entrez.egquery(term="Nrg1")
record = Entrez.read(handle)
handle.close()

# we can iterate through the record and only return the 'nucleotide' result
for row in record["eGQueryResult"]:
    if row["DbName"]=="gene":
        # print how many gene entries there are
        print("There are "+row["Count"]+" gene entries for the gene Nrg1")

# you might notice this is different to the web page result. if you click the gene link you will notice on
# the side box that details the query another term has been added to make sure the returned gene entries are
# live in the database. this excludes retired and/or redirected links and gives the true value
handle = Entrez.egquery(term="Nrg1 AND alive[prop]")
record = Entrez.read(handle)
handle.close()

# we can iterate through the record and only return the 'nucleotide' result
for row in record["eGQueryResult"]:
    if row["DbName"]=="gene":
        # print how many alive gene entries there are
        print("There are "+row["Count"]+" live gene entries for the gene Nrg1")

In [124]:
#search for the gene accessions
handle = Entrez.esearch(db='gene',term="Nrg1[Gene] AND human[Organism]")
human_gene_ids = Entrez.read(handle)['IdList']
handle.close()

# #fetch the first gene entry - this is the summary you've found already
print("human gene id",human_gene_ids)
handle = Entrez.efetch(db='gene',id=human_gene_ids[0],retmode='text')
print(handle.read())
handle.close()

#OK so this is the XML data which is the full record as you've seen
handle = Entrez.efetch(db='gene',id=human_gene_ids,rettype='gb',retmode='xml')

#I'm going to use ElementTree to parse the XML, this is PAINFUL!
#You need to examine the structure of the XML to work out which element tags to use, I'm not sure there's an easier way!
import xml.etree.ElementTree as ET
tree = ET.parse(handle)
all_name_elements = tree.findall('.//Other-source')

print("GO annotations for the Gene\n")
for source in all_name_elements:
    source_ids = source.findall('.//Dbtag_db')
    for source_id in source_ids:
        if source_id.text == 'GO':
            GO_terms = source.findall('Other-source_anchor')
            for GO_term in GO_terms:
                print(GO_term.text)
            # print(ET.tostring(source))
        else:
            pass

handle.close()


human gene id ['3084']

1. NRG1
Official Symbol: NRG1 and Name: neuregulin 1 [Homo sapiens (human)]
Other Aliases: ARIA, GGF, GGF2, HGL, HRG, HRG1, HRGA, MST131, MSTP131, NDF-IT2, SMDF, NRG1
Other Designations: pro-neuregulin-1, membrane-bound isoform; acetylcholine receptor-inducing activity; glial growth factor 2; heregulin, alpha (45kD, ERBB2 p185-activator); neu differentiation factor; pro-NRG1; sensory and motor neuron derived factor
Chromosome: 8; Location: 8p12
Annotation: Chromosome 8 NC_000008.11 (31639245..32774046)
MIM: 142445
ID: 3084


GO annotations for the Gene

ErbB-2 class receptor binding
ErbB-3 class receptor binding
chemorepellent activity
cytokine activity
growth factor activity
integrin binding
protein tyrosine kinase activator activity
receptor tyrosine kinase binding
signaling receptor binding
transcription coregulator activity
transmembrane receptor protein tyrosine kinase activator activity
ERBB signaling pathway
ERBB signaling pathway
ERBB3 signaling pathway


### Challenge 2 - Human and Mouse Nrg1 Genes
Using either the Entrez website to search and/or using what you've learned about BioPython's abilities to query NCBI services retreive full-length human and mouse (RefSeq) gene entries for Nrg1.
- What are the accession numbers / ids of the Genbank records?
- How long are the Human and Mouse NRG1, Nrg1 proteins?
- How many nucleotide sequence differences are there between their longest CDs?
- How many protein sequence differences are there between their longest proteins?

In [None]:
from Bio import SeqIO

#From above we can find the human and mouse versions the accession ids
handle = Entrez.esearch(db='gene',term="Nrg1[Gene] AND human[Organism]")
human_gene_ids = Entrez.read(handle)['IdList']
handle = Entrez.esearch(db='gene',term="Nrg1[Gene] AND mouse[Organism]")
mouse_gene_ids = Entrez.read(handle)['IdList']
handle.close()

#Accession Numbers of Gene Entries
print("human gene id",human_gene_ids)
print("mouse gene ids",mouse_gene_ids)

#convenience function to find the genomic sequence entries from a gene_id
def find_gene_sequence(gene_id):
    handle = Entrez.efetch(db='gene',id=gene_id,rettype='gb',retmode='xml')
    gene_entry = Entrez.read(handle)

    #Get the accession, start and stop locations from the genbank XML file
    accession = gene_entry[0]['Entrezgene_locus'][0]['Gene-commentary_accession']
    start = gene_entry[0]['Entrezgene_locus'][0]['Gene-commentary_seqs'][0]['Seq-loc_int']['Seq-interval']['Seq-interval_from']
    stop = gene_entry[0]['Entrezgene_locus'][0]['Gene-commentary_seqs'][0]['Seq-loc_int']['Seq-interval']['Seq-interval_to']

    #Retreive the annotated sequence and parse it for protein features
    handle = Entrez.efetch(db='nuccore',id=accession, seq_start=start, seq_stop=stop, rettype='gb', retmode='text')
    record = SeqIO.read(handle, "genbank")
    return(record)

#Get the gene records
human_record = find_gene_sequence(human_gene_ids[0])
mouse_record = find_gene_sequence(mouse_gene_ids[0])

# for convenience I've defined a function that allows me to pass any suitable record and find the longest
# protein sequence
def find_longest_protein(record):
    longest_protein_length = 0
    longest_cds = ''

    if record.features:
        for feature in record.features:
            #this tag identifies the CoDingSequences from the record
            if feature.type == "CDS":
                current_sequence = feature.location.extract(record).seq
                #translate the current sequence into protein
                translation = current_sequence.translate()
                if len(translation) > longest_protein_length:
                    longest_protein_length = len(translation)
                    longest_cds = current_sequence
    #             print(feature.qualifiers["gene"],feature.qualifiers["protein_id"],len(translation))

    print("Longest Protein -",longest_protein_length,"amino acids\nCDS -",longest_cds,"\nProtein -",longest_cds.translate())
    handle.close()
    return(longest_cds)

# call the function to find the longest proteins for these genes
print("Human")
human_cd = find_longest_protein(human_record)
print("Mouse")
mouse_cd = find_longest_protein(mouse_record)

# The last two questions can be done online, but in order to do them programatically you need to be able to perform
# pairwise sequence alignment using python

In [None]:
from Bio import pairwise2

#CDS
alignment = pairwise2.align.globalxx(human_cd,mouse_cd)

#the number of identical matches in the CDS alignment
aligned = alignment[0][2]

#work out the non-identical matches per sequence
print("Nucleotide")
print("Non-aligned human bases",int(len(human_cd)-aligned))
print("Non-aligned mouse bases",int(len(mouse_cd)-aligned))


#Proteins
alignment = pairwise2.align.globalxx(human_cd.translate(),mouse_cd.translate())

#the number of identical matches in the protein lignment
aligned = alignment[0][2]

#work out the non-identical matches per sequence
print("\nProtein")
print("Non-aligned human amino acids",int(len(human_cd.translate())-aligned))
print("Non-aligned mouse amino acids",int(len(mouse_cd.translate())-aligned))