# Biopython

[Biopython](www.biopython.org) is a well-maintained collection of modules, tools, and scripts specifically designed to help with common bioinformatics tasks.  It is free and open source, with a large number of contributors maintaining the package. 

## Sequences and Sequence objects
Earlier, we created and manipulated DNA sequences as strings. A core component of biopython is the creation of specific 'classes' to store and manipulate certain types of items. One of these is the `Seq` class which is designed to hold all of the relevant information needed for biological sequences (DNA, RNA, protein, etc). There are two primary components for a biological sequences.  The sequence itself and the 'Alphabet' used to define what the characters in the sequence represent (i.e. [A,G,C,T] for DNA or [A,G,C,U] for RNA).

Lets start by creating a 'generic' sequence (`Seq` object) in biopython.

In [22]:
from Bio.Seq import Seq

mySeq = Seq("TTGAGCAGTTGAGCGATTATGTCGATTAGCTGAGGCATTCGAGCTATG")

print(mySeq)
print(type(mySeq))

#str(mySeq)

print(mySeq.alphabet)

TTGAGCAGTTGAGCGATTATGTCGATTAGCTGAGGCATTCGAGCTATG
<class 'Bio.Seq.Seq'>
Alphabet()


So why go to all of this trouble to create a `Seq` object instead of just storing the DNA seqence as a string?  Well, the `Seq` object defines a number of methods (function) that code for useful _biological_ operations:

In [28]:
print(mySeq)
print(mySeq.complement())
print(mySeq.reverse_complement())
print(mySeq.transcribe())
print(mySeq.transcribe().translate())
print(mySeq.count("TT"))

TTGAGCAGTTGAGCGATTATGTCGATTAGCTGAGGCATTCGAGCTATG
AACTCGTCAACTCGCTAATACAGCTAATCGACTCCGTAAGCTCGATAC
CATAGCTCGAATGCCTCAGCTAATCGACATAATCGCTCAACTGCTCAA
UUGAGCAGUUGAGCGAUUAUGUCGAUUAGCUGAGGCAUUCGAGCUAUG
LSS*AIMSIS*GIRAM
5


`Seq` objects store sequence information `SeqRecord` objects store metadata information associated with an individual `Seq` object including an identifier (id) a name, and a description.  We'll see more of these as we progress.

## Parsers
An important task in bioinformatics is 'parsing' files of different format. There are many types of file formats for biological data, some of which are human readable, and some are not readily so. And some of them can contain thousands or millions of data points that you may need to iterate over in order to analyze. Thankfully biopython maintains several functions in the `Bio.SeqIO` module to parse common biological file formats.

You've already been introduced to the FASTA format earlier today. FASTA files can contain many unique sequences within a single file. Using biopython, we can parse a FASTA file into individual `SeqRecord` objects and then iteratively do something with them.

In [30]:
from Bio import SeqIO

fasta_file = SeqIO.parse("test.fa",'fasta')

for record in fasta_file:
    print(record.seq)

TGGCGTTGTCTTTAATTCGATTAGCATTGCATATCGATTATCTAGCGATATGCTATGCTTAGC
AGCCATATGTATCGAGCGATATGCGATACGAGTATCGAGTATGCAGTATGCATTGCAGTATGC
TGGGTATGCGCGACGAGCATTATACGATGCATTATTACGGATCTACGGCGATATTACGTACGA
GGACGTATCGAGTCTAGCGAGCGACTTCGAGCGATATCGGACTCTCGTCCTCTTCAGTCAGCC


In [1]:
### Motif analysis

In [None]:
### Restriction digestion

## Parsing biology-formatted files

## Querying NCBI datasets

## Motif analysis

## Example(s) in the real

In [None]:
#Pick your favorite gene
