# 4. Introduction to Biopython

Biopython is a set of freely available tools for biological computation written in Python. It is a distributed collaborative effort to develop Python libraries and applications which address the needs of current and future work in bioinformatics. Before diving into your (biological) data and try analyzing it with complex self-written scripts, it makes sense to search through the [Biopython documentation](http://biopython.org/DIST/docs/tutorial/Tutorial.html) and look for out-of-the-box solutions as part of Biopython.  

![Logo biopython](img/logo_biopython.PNG)

In the next few chapters, we'll learn some of Biopython's most frequently used functionalities. There are several ways of importing Biopython. Installing the complete module can be done:
- Using Anaconda's environments and searching for the package, or
- Immediately in a Notebook using the following code:

In [1]:
# pip install biopython 

# Import the Biopython library
import Bio

If there was no error doing this, than you're probably good to go. You can surely check this with asking Python what version you have installed. 

In [2]:
# Check version for proper installment (v1.74)
print(Bio.__version__)

1.74


Of course it makes sense to install functions or submodules that are part of Biopython in order to ease the use. Imagine that you want to work with sequences, you can import the Seq-object in the following way. This will allow you to work directly with the Seq (sequence) object. 

In [5]:
from Bio.Seq import Seq

In [12]:
help(Seq)

Help on class Seq in module Bio.Seq:

class Seq(builtins.object)
 |  Seq(data, alphabet=Alphabet())
 |  
 |  Read-only sequence object (essentially a string with an alphabet).
 |  
 |  Like normal python strings, our basic sequence object is immutable.
 |  This prevents you from doing my_seq[5] = "A" for example, but does allow
 |  Seq objects to be used as dictionary keys.
 |  
 |  The Seq object provides a number of string like methods (such as count,
 |  find, split and strip), which are alphabet aware where appropriate.
 |  
 |  In addition to the string like sequence, the Seq object has an alphabet
 |  property. This is an instance of an Alphabet class from Bio.Alphabet,
 |  for example generic DNA, or IUPAC DNA. This describes the type of molecule
 |  (e.g. RNA, DNA, protein) and may also indicate the expected symbols
 |  (letters).
 |  
 |  The Seq object also provides some biological methods, such as complement,
 |  reverse_complement, transcribe, back_transcribe and translate 

**Note**: The `Bio.SeqIO` and `Bio.AlignIO` functions parse, read and write don't take filenames as input, instead you always need to input filehandles. It is especially important to remember to close output handles explicitly after writing your data.

Biopython has many ways of working with sequence data. Today we'll have a look at the following components:
- Working with sequences in : `Seq` and `Alphabets`,
- Sequence annotations with: `SeqRecord` objects,
- Reading, writing and parsing files with: `SeqIO`
- Querying NCBI with: `SeqIO`


# 2 Quick Start - What can you do with Biopython?

It makes sense to include some features of Biopython here as a way of summarizing everything that will come afterwards. 

## 2.1 Working with sequences

we'll start with a quick
introduction to the Biopython mechanisms for dealing with sequences, the Seq object. a string of letters like `AGTACACTGGT'.

In [None]:
from Bio.Seq import Seq
my_seq = Seq("AGTACACTGGT")
my_seq

In [None]:
print(my_seq)

In [None]:
my_seq.alphabet

Sequence object with a *generic* alphabet, not specified that it is a DNA (or unlikely) a protein sequence. 

In [None]:
my_seq

In [None]:
my_seq.complement()

In [None]:
my_seq.reverse_complement()

next most important class is the SeqRecord or Sequence Record. This holds a sequence (as a Seq
object) with additional annotation including an identifier, name and description. The Bio.SeqIO module
for reading and writing sequence file formats works with SeqRecord objects,

## 2.4 Parsing sequence file formats
Case study about Orchids. Two files were downloaded which will be repeatedly will be used in this tutorial. The fasta format has 94 records each starting with the ">" symbol followed by the sequence on one or more lines. 

In [3]:
from Bio import SeqIO
for seq_record in SeqIO.parse("data/ls_orchid.fasta","fasta"):
    print(seq_record.id) # will print the id's of the records
    print(repr(seq_record.seq)) # will print the sequences of each record
    print(len(seq_record)) # will print the length of each sequence

gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())
740
gi|2765657|emb|Z78532.1|CCZ78532
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAG...GGC', SingleLetterAlphabet())
753
gi|2765656|emb|Z78531.1|CFZ78531
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TAA', SingleLetterAlphabet())
748
gi|2765655|emb|Z78530.1|CMZ78530
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAAACAACAT...CAT', SingleLetterAlphabet())
744
gi|2765654|emb|Z78529.1|CLZ78529
Seq('ACGGCGAGCTGCCGAAGGACATTGTTGAGACAGCAGAATATACGATTGAGTGAA...AAA', SingleLetterAlphabet())
733
gi|2765652|emb|Z78527.1|CYZ78527
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...CCC', SingleLetterAlphabet())
718
gi|2765651|emb|Z78526.1|CGZ78526
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...TGT', SingleLetterAlphabet())
730
gi|2765650|emb|Z78525.1|CAZ78525
Seq('TGTTGAGATAGCAGAATATACATCGAGTGAATCCGGAGGACCTGTGGTTATTCG...GC

## 2.5 Connecting with biological databases
to do in bioinformatics is extract information from biological databases.tedious to access these databases manually, especially if you have a lot of repetitive
work to do.  
Biopython has code to extract information from the following databases:
- Entrez: molecular biology database system that provides integrated access to nucleotide and protein sequence data, gene-centered and genomic mapping information, 3D structure data, PubMed MEDLINE, and more.
- ExPASy: SIB Bioinformatics Resource Portal - proteomics, genomics, phylogeny, systems biology, population genetics, transcriptomics etc.
- SCOP: Structural Classification of Proteins - structural and evolutionary relationships between all proteins whose structure is known

In [1]:
from Bio.PDB import *

In [2]:
pdbl = PDBList()
pdbl.retrieve_pdb_file("1HHO")



Downloading PDB structure '1HHO'...


'C:\\Users\\tuurm\\OneDrive - VIB\\Trainingen\\Python for downstream data-analysis\\Autumn2020\\hh\\1hho.cif'

In [3]:
parser = MMCIFParser()
struct = parser.get_structure("1HHO", "hh/1hho.cif")
struct



<Structure id=1HHO>

In [6]:
model = struct[0]
model

<Model id=0>

In [7]:
for chain in model: 
    print(chain)

<Chain id=A>
<Chain id=B>


In [9]:
pip install nglview

Collecting nglview
  Downloading https://files.pythonhosted.org/packages/6d/21/1e1b26979a11e8b812dfc564c525818f69650e78f104260ec96b07beb405/nglview-2.7.7.tar.gz (5.2MB)
Building wheels for collected packages: nglview
  Building wheel for nglview (setup.py): started
  Building wheel for nglview (setup.py): finished with status 'done'
  Stored in directory: C:\Users\tuurm\AppData\Local\pip\Cache\wheels\24\14\56\fc0406301850d9d017a748b568b2e97566148deda046ac56dc
Successfully built nglview
Installing collected packages: nglview
Successfully installed nglview-2.7.7
Note: you may need to restart the kernel to use updated packages.


In [10]:
# 3D visualization
import nglview as nv



NameError: name 'NGLWidget' is not defined

In [20]:
view = nv.show_biopython(struct)

In [18]:
view

NGLWidget()

In [22]:
pip install py3Dmol

Collecting py3Dmol
  Downloading https://files.pythonhosted.org/packages/be/49/351994057806d6076d22b51215d2b18b1e332b646f1fe8719907aa0b12ff/py3Dmol-0.8.1-py2.py3-none-any.whl
Installing collected packages: py3Dmol
Successfully installed py3Dmol-0.8.1
Note: you may need to restart the kernel to use updated packages.


In [23]:
import py3Dmol


In [26]:
view1 = py3Dmol.view(query='pdb:1HHO')


In [27]:
view1.setStyle({'cartoon':{'color':'spectrum'}})
view1

<py3Dmol.view at 0x1e29b104f98>

In [29]:
#!/bin/python3

from Bio.PDB import *
import matplotlib.pyplot as plt

pdbl = PDBList()
pdbl.retrieve_pdb_file("1HHO")

parser = MMCIFParser()
struct = parser.get_structure("1HHO", "hh/1hho.cif")

# Part 1: Plot a structure
calphas = [res["CA"].get_coord() for res in struct[0]["A"] if "CA" in res]

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot([xyz[0] for xyz in calphas],
        [xyz[1] for xyz in calphas],
        [xyz[2] for xyz in calphas], color='skyblue')
ax.view_init(30, 185)
plt.show()

# Part 2: Plot an aligned pair of structures
chain_A = [res["CA"] for res in struct[0]["A"] if "CA" in res]
chain_B = [res["CA"] for res in struct[0]["B"] if "CA" in res][:141] # Explain superposition

sup = Superimposer()
sup.set_atoms(chain_A, chain_B)
sup.apply(chain_B)

calphas_A = [atom.get_coord() for atom in chain_A]
calphas_B = [atom.get_coord() for atom in chain_B]


fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot([xyz[0] for xyz in calphas_A],
        [xyz[1] for xyz in calphas_A],
        [xyz[2] for xyz in calphas_A], color='green')
ax.plot([xyz[0] for xyz in calphas_B],
        [xyz[1] for xyz in calphas_B],
        [xyz[2] for xyz in calphas_B], color='red')
ax.view_init(30, 185)
plt.show()



Structure exists: 'C:\Users\tuurm\OneDrive - VIB\Trainingen\Python for downstream data-analysis\Autumn2020\hh\1hho.cif' 


ValueError: Unknown projection '3d'

<Figure size 432x288 with 0 Axes>