# Biopython

[Biopython](www.biopython.org) is a well-maintained collection of modules, tools, and scripts specifically designed to help with common bioinformatics tasks.  It is free and open source, with a large number of contributors maintaining the package. 

## Sequences and Sequence objects
Earlier, we created and manipulated DNA sequences as strings. A core component of biopython is the creation of specific 'classes' to store and manipulate certain types of items. One of these is the `Seq` class which is designed to hold all of the relevant information needed for biological sequences (DNA, RNA, protein, etc). There are two primary components for a biological sequences.  The sequence itself and the 'Alphabet' used to define what the characters in the sequence represent (i.e. [A,G,C,T] for DNA or [A,G,C,U] for RNA).

Lets start by creating a 'generic' sequence (`Seq` object) in biopython.

In [None]:
from Bio.Seq import Seq # first we import a function from a sub module of the Bio (biopython) module directly.

mySeq = Seq("TTGAGCAGTTGAGCGATTATGTCGATTAGCTGAGGCATTCGAGCTATG")

print(mySeq)
print(type(mySeq))

#str(mySeq)

print(mySeq.alphabet)

So why go to all of this trouble to create a `Seq` object instead of just storing the DNA seqence as a string?  Well, the `Seq` object defines a number of methods (functions) that code for useful _biological_ operations:

In [None]:
print(mySeq)
print(mySeq.complement())
print(mySeq.reverse_complement())
print(mySeq.transcribe())
print(mySeq.transcribe().translate())
print(mySeq.count("TT"))

`Seq` objects store sequence information `SeqRecord` objects store metadata information associated with an individual `Seq` object including an identifier (id) a name, and a description.  We'll see more of these as we progress.

## Parsers
An important task in bioinformatics is 'parsing' files of different format. There are many types of file formats for biological data, some of which are human readable, and some are not readily so. And some of them can contain thousands or millions of data points that you may need to iterate over in order to analyze. Thankfully biopython maintains several functions in the `Bio.SeqIO` module to parse common biological file formats.

You've already been introduced to the FASTA format earlier today. FASTA files can contain many unique sequences within a single file. Using biopython, we can parse a FASTA file into individual `SeqRecord` objects and then iteratively do something with them.

In [None]:
from Bio import SeqIO # notice we are importing a different sub module called SeqIO

fasta_file = SeqIO.parse("test.fa",'fasta')

# Lets access the `Seq` object contained within each returned `SeqRecord` and print that out.
for record in fasta_file:
    print(record.seq)

When we use this and other parsers, they will return an object of a relevant class.  In the case of `SeqRecord` objects, there are additional properties parsed and stored

In [None]:
from Bio.Alphabet import generic_dna
fasta_file = SeqIO.parse("test.fa",'fasta',alphabet=generic_dna) # Reconnect and parse the 'test.fa' file. This time we will specify an Alphabet for the records to indicate they are DNA sequences.

# Lets grab the first record
a = next(fasta_file)

print(a)

print(a.id)
print(len(a))
print(a.description)

# To see a full list of available attributes (not all of them may be populated) we can use help()
help(a)


And we can export or print the object in a variety of common biological formats (prespecified formats in [Bio.SeqIO](https://biopython.org/wiki/SeqIO). This makes for a very convenient way to convert sequence records between formats

In [None]:
print(a.format('tab'))
print(a.format('embl'))
print(a.format('stockholm'))
print(a.format('clustal'))
print(a.format('fasta'))

We can manipulate and slice `SeqRecord` objects as well. This will return a _new_ `SeqRecord` object.

In [None]:
a[:5] #retrieve the first 5 elements of the string

You can combine (concatenate) `SeqRecord` obejects together as well

In [None]:
b = next(fasta_file)

print(b)

new_seq = a + b

print(new_seq)
print(len(new_seq))

## Querying NCBI datasets
The National Center for Biotechnology Information (NCBI) maintains several useful data repositories and databases for biological data. [Entrez](https://www.ncbi.nlm.nih.gov/Web/Search/entrezfs.html) is a web-based data retrieval system that provides access to the NCBI databases. Biopython provides tools to access and retrieve data through Entrez from within your programs or scripts. Lets see some examples of how we can use these tools. First, lets get some information about what databases are available to query.

In [None]:
from Bio import Entrez

# we must identify ourselves to the Entrez service in order to access
Entrez.email = "loyalgoff@jhmi.edu"

# make an initial connection in the form of a handle
handle = Entrez.einfo()

result = handle.read() # similar to reading a file

handle.close() # close the handle when you are done

print(result)


The results are returned (by default) in XML format. But we can use Entrez's parser `Entrez.read` to read the handle and parse the XML directly into a more python-friendly dictionary.

In [None]:
# make a new connection to Entrez
handle = Entrez.einfo()

result = Entrez.read(handle)

result.keys() # returns a list of keys for the dictionary


It seems this record has only one dictionary key called 'DbList' lets see whats inside

In [None]:
result['DbList']

This is a list of all of the available databases which we can query with Entrez using the Biopython interface. We can recursively learn more about each of these databases by usine EInfo again.

In [None]:
handle = Entrez.einfo(db="nucleotide")

result = Entrez.read(handle)

print(result["DbInfo"].keys())

print(result["DbInfo"]["Description"])
print(result["DbInfo"]["LastUpdate"])

The `FieldList` element is a list of possible search fields for this database. Let's dig a bit deeper

In [None]:
for field in result['DbInfo']['FieldList']:
    print(f'{field["Name"]}\t {field["FullName"]}\t {field["Description"]}')

Now you try and get information about another NCBI database through `Entrez.einfo()`

### Searching for gene information and ids
Now we can use this information to begin to query the nucleotide database at NCBI using the above fields to construct our search. Lets see if we can search records associated with the human gene 'Notch1' using `esearch()` 

In [None]:
handle = Entrez.esearch(
                db = 'nucleotide',
                term = "Homo[Orgn] AND NOTCH1[Gene]"
    )

result = Entrez.read(handle)

print(result)

print(result['Count'])

We have identified 24 records associated with this gene.  Here we can see the Entrez IDs assocated with each record.  We will use these to 'fetch' the full record for the first few id records

In [None]:
notch1_ids = result['IdList']
print(str(notch1_ids))

In [None]:
handle = Entrez.efetch(db="nucleotide", id=notch1_ids[:2], rettype="fasta", retmode="text") # here we are specifically requesting these records in a fasta format

results = SeqIO.parse(handle,'fasta') # and now we can pass this handle to Bio.SeqIO.parse to parse the handle into SeqRecord objects

# and now we can loop through the list of 'SeqRecord' objects
for res in results:
    print(res.format('fasta'))

Of course there are a number of different 'pythonic' ways in which you can achieve this as well.  For example:

In [None]:
with Entrez.efetch(
    db="nucleotide", rettype="gb", retmode="text", id="1389542951"
) as handle:
    record = SeqIO.read(handle, "gb")
print(f'Gene ID {record.id} contains {len(record.features)} features')

In [None]:
# Lets take a look at some of these features...

for feature in record.features[:20]:
    print(feature)
    
type(record.features[0])

## Example in the real
Lets try to do a common exercise in sequence analysis, and compare two sequences for similarity.  To do this, we will introduce you to a dot plot, which is a visual way of comparing sequences to each other. Let's compare the sequence of the same gene between mouse and human.

In [None]:
# Pick a gene
gene = "Hoxc6"


# Search for human records
handle = Entrez.esearch(
                db = 'nucleotide',
                term = f"Homo[Orgn] AND {gene}[Gene]"
    )

hum = Entrez.read(handle)

humID = hum['IdList'][0]

# Search for mouse records
handle = Entrez.esearch(
                db = 'nucleotide',
                term = f"Mus[Orgn] AND {gene}[Gene]"
    )

mus = Entrez.read(handle)

musID = mus['IdList'][0]

#Retrieve full human record
with Entrez.efetch(
    db="nucleotide", rettype="gb", retmode="text", id=humID
) as handle:
    hRecord = SeqIO.read(handle, "gb")
    
#Retrieve full mouse record
with Entrez.efetch(
    db="nucleotide", rettype="gb", retmode="text", id=musID
) as handle:
    mRecord = SeqIO.read(handle, "gb")


In [None]:
#Calculate the similarity betwen the two sequences over a certain window size
window = 10
hSeq = hRecord.seq
mSeq = mRecord.seq
data = [[(hSeq[i:i+window] == mSeq[j:j+window]) for j in range(len(hSeq)-window)] for i in range(len(mSeq)-window)]

In [None]:
%matplotlib inline 
import seaborn as sns
import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = (12,8)

ax = sns.heatmap(data)
ax.set(xlabel=f'Human {gene}',ylabel=f'Mouse {gene}',title=f"{gene} Dot Plot - Mouse vs Human")
plt.show()