# Part 2 - Accessing & Working with DNA, RNA & Protein Sequences

In this notebook we will start working with biological sequences by retreiving records, looking at their structure and the information that is associated with them. We will also start manipulating the sequences and performing some basic analysis to become more familiar with the sorts of operations and processes we can perform.

We have included web links were appropriate to additional information and web based resrouces that can be used to either replace or complement working in the Python environment. It is absolutely fine to use web based tools to perform Bioinformatic work, but those tools are often limited in their functionality in ways that eventually become problematic in real-life anaysis situations. This is why, if you would like to pursue further study and/or research in Bioinformatics and related disciplines it is a good plan to begin learning the two core programming languages that are in common use, namely [Python](https://www.learnpython.org) and the Statistical programming language [R](https://cran.r-project.org).

In [None]:
# install and/or load BioPython
%pip install biopython

# replace this with your e-mail address
EMAIL = 'A.N.Other@example.com'

First we load the Entrez module from BioPython.

You can read the description of this module [here](https://biopython.org/DIST/docs/api/Bio.Entrez-module.html)

In [None]:
from Bio import Entrez

Entrez.email = EMAIL

# note the egquery function provides Entrez database counts from a global search.
handle = Entrez.egquery(term="Cypripedioideae")
record = Entrez.read(handle)
handle.close()

print(type(record))

# Look at what is inside the record object
print(record.keys())

# The first contains the search term
print(record['Term'])

# The second contains a list of results from different Entrez Databases
for row in record['eGQueryResult']:
    print(row)

# we can iterate through the record and only return the 'nucleotide' result
for row in record["eGQueryResult"]:
    if row["DbName"]=="nuccore":
        print('***',row)
        # print just how many nucleotide entries there are
        print(row["Count"])

Note the number of nucleotide sequences returned and compare it to the result you get if you seach for "Cypripedioideae" using the [Entrez Search Webpage](https://www.ncbi.nlm.nih.gov/search/). For interest, these are a sub-family of Orchid (one member is the [Lady's Slipper Orchid](https://en.wikipedia.org/wiki/Cypripedium_calceolus))

Lets now select a particular sequence and download it for further analysis.

In [None]:
from Bio import Entrez

Entrez.email = EMAIL

# we're going to search for up to 1000 sequences and we're going to ask for the accession number for each

# note the Entrez esearch function searches and returns a handle to the results.
handle = Entrez.esearch(db='nucleotide',term="Cypripedioideae",retmax=1000,idtype='acc')
record = Entrez.read(handle)
handle.close()

#look at the first 10 ids
print(record['IdList'][:10])

In [None]:
#lets fetch one
accession = record['IdList'][500]

handle = Entrez.efetch(db="nucleotide", id=accession, retmode="xml")
entry = Entrez.read(handle)
handle.close()

#print the whole entry (this is a GenBank record in XML format)
print(entry)

In [None]:
print(entry[0]['GBSeq_definition'])
print(entry[0]['GBSeq_organism'])

We can retreive the record in a more user-friendly format

In [None]:
handle = Entrez.efetch(db="nuccore", id=accession, rettype="gb", retmode="text")
print(handle.read())

We can use the Bio.SeqIO module which handles groups of records to capture the search and create a Bio.Seq.Seq sequence object

In [None]:
from Bio import SeqIO
handle = Entrez.efetch(db="nuccore", id=accession, rettype="gb", retmode="text")
records = SeqIO.parse(handle, "gb")

for entry in records:
    sequence = entry.seq
    print(sequence)
    print(type(sequence))
    
print('complement',sequence.complement())
print('reverse_complement',sequence.reverse_complement())

The real power of this system comes when you want to search and work with a lot of sequences.

Lets say we want to search for Gene entries for Pax6

In [None]:
#search for

from Bio import Entrez

Entrez.email = EMAIL

# we're going to limit this to 100 sequences and we're going to ask for the accession number for each

# note the Entrez esearch function searches and returns a handle to the results.
handle = Entrez.esearch(db='nucleotide',term="Pax6[Gene]",retmax=100)
record = Entrez.read(handle)
handle.close()

#look at the first 10 ids
print(record['IdList'][:10])

In [None]:
# now lets fetch them all, to do this we extract the accession id list

gi_list = record['IdList']

#then turn it into a comma-separated string

gi_str = ",".join(gi_list)

handle = Entrez.efetch(db="nucleotide", id=gi_str, rettype="gb", retmode="text")
records = SeqIO.parse(handle, "gb")

for record in records:
    print("%s, length %i, from organism %s" % (record.name, len(record), record.description))

Now we're going to pull a full gene entry for human Pax6 from Genbank and look at it, we can also do this online by clicking [here](https://www.ncbi.nlm.nih.gov/nuccore/208879460).

In [None]:
from Bio import Entrez

Entrez.email = EMAIL
handle = Entrez.efetch(db="nucleotide", id="208879460", rettype="gb", retmode="text")
gb_entry = handle.read()
handle.close()

#NB this is just a straight string at this point (as we just read() it straight into a string object)
print(gb_entry)

Now we're going to extract the coding sequence from this entry and translate it into protein

In [None]:
from Bio import SeqIO
from Bio import Entrez

Entrez.email = EMAIL
handle = Entrez.efetch(db="nucleotide", id="208879460", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")

if record.features:
    for feature in record.features:
        #this tag identifies the CoDingSequences from the record
        if feature.type == "CDS":
            print(feature.qualifiers["protein_id"])
            print(feature.location,'\n')
            current_sequence = feature.location.extract(record).seq
            print('Nucleotide Sequence')
            print(current_sequence,'\n')
            #translate the current sequence into protein
            print('Protein Sequence')
            print(current_sequence.translate(),'\n')


In [None]:
from Bio import Entrez

Entrez.email = EMAIL

# note the Entrez esearch function searches and returns a handle to the results.
handle = Entrez.esearch(db='gene',term="Nrg1[Gene] AND human",retmax=100)
record = Entrez.read(handle)
handle.close()

#look at the first 10 ids
print(record['IdList'][:10])

# lets retrieve as XML format and use the Entrez parser to read it
handle = Entrez.efetch(db="gene", id=record['IdList'][:1], retmode="xml")
# this returns an array of records which are in Python dict format
records = Entrez.read(handle)
handle.close()

# look at the first record by iterating through the keys of the dict
# NB there's a lot of information in here
for feature in list(records[0]):
    print(feature,':',records[0][feature])

### Challenge 1 - Finding Genes with NCBI-Entrez
Using either the Entrez website to search and/or using what you've learned about BioPython's abilities to query NCBI services retreive entries for a gene called Nrg1.
- How many different gene entries are there for this gene in NCBI databases?
- What is the full name of this gene?
- What kind of protein does this gene encode?

### Challenge 2 - Human and Mouse Nrg1 Genes
Using either the Entrez website to search and/or using what you've learned about BioPython's abilities to query NCBI services retreive full-length human and mouse (RefSeq) gene entries for Nrg1.
- What are the accession numbers / ids of the Genbank records?
- How long are the Human and Mouse NRG1, Nrg1 proteins?
- How many nucleotide sequence differences are there between their longest CDs?
- How many protein sequence differences are there between their longest proteins?