# Today's session

In today's session, we will practice with some of the genomics tools that were introduced in the theoretical lectures. In particular, we will see how to work with sequences, align them, and identify motifs. To do this, we will be using `Biopyton`, a comprehensive Python module that allows you to do these and many other things. For a complete tour of `Biopython`, take a look at the [complete tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.html).

In the final exercise for this class you will be working on the coronavirus.

# What is Biopython?

The Biopython Project is an international association of developers of freely available Python tools for computational molecular biology.

![Biopython](Media/biopython_logo.png)


The main Biopython releases have lots of functionality, including:

* The ability to parse bioinformatics files into Python utilizable data structures, including support for the following formats:
    - Blast output – both from standalone and WWW Blast
    - Clustalw
    - FASTA
    - GenBank
    - PubMed and Medline
    - ExPASy files, like Enzyme and Prosite
    - SCOP, including ‘dom’ and ‘lin’ files
    - UniGene
    - SwissProt 
* Files in the supported formats can be iterated over record by record or indexed and accessed via a Dictionary interface.
* Code to deal with popular on-line bioinformatics destinations such as:
    - NCBI – Blast, Entrez and PubMed services
    - ExPASy – Swiss-Prot and Prosite entries, as well as Prosite searches 
* Interfaces to common bioinformatics programs such as:
    - Standalone Blast from NCBI
    - Clustalw alignment program
    - EMBOSS command line tools 
* A standard sequence class that deals with sequences, ids on sequences, and sequence features.
* Tools for performing common operations on sequences, such as translation, transcription and weight calculations.
* Code for dealing with alignments, including a standard way to create and deal with substitution matrices.
* Code making it easy to split up parallelizable tasks into separate processes.
* GUI-based programs to do basic sequence manipulations, translations, BLASTing, etc.
* Extensive documentation and help with using the modules, including the tutorial, on-line wiki documentation, the web site, and the mailing list.
* Integration with BioSQL, a sequence database schema also supported by the BioPerl and BioJava projects.

In [5]:
from Bio import SeqIO

In [6]:
records = list(SeqIO.parse('Data/sequence.gb', 'genbank'))
for record in records:
    print(record.description)

Middle East respiratory syndrome-related coronavirus isolate NL13892, complete genome
Middle East respiratory syndrome-related coronavirus isolate NL140455, complete genome
Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/01/human/2020/SWE, complete genome
Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/WH-09/human/2020/CHN, complete genome
Severe acute respiratory syndrome coronavirus 2 isolate SARS0CoV-2/61-TW/human/2020/ NPL, complete genome
Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/NTU01/2020/TWN, complete genome
Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/NTU02/2020/TWN, complete genome
Severe acute respiratory syndrome coronavirus 2 isolate 2019-nCoV/USA-IL2/2020, complete genome
Severe acute respiratory syndrome coronavirus 2 isolate 2019-nCoV/USA-CA6/2020, complete genome
Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/Yunnan-01/human/2020/CHN, complete genome
Severe acute re

Human coronavirus OC43 strain OC43/human/USA/912-6/1991, complete genome
Human coronavirus OC43 strain OC43/human/USA/911-38/1991, complete genome
Human coronavirus OC43 strain OC43/human/USA/9211-43/1992, complete genome
Human coronavirus OC43 strain OC43/human/USA/965-6/1996, complete genome
Human coronavirus OC43 strain OC43/human/USA/971-5/1997, complete genome
Human coronavirus NL63 strain NL63/human/USA/904-20/1990, complete genome
Human coronavirus NL63 strain NL63/human/USA/012-31/2001, complete genome
Human coronavirus NL63 strain NL63/human/USA/8712-17/1987, complete genome
Human coronavirus NL63 strain NL63/human/USA/911-56/1991, complete genome
Human coronavirus NL63 strain NL63/human/USA/891-6/1989, complete genome
Human coronavirus NL63 strain NL63/human/USA/903-28/1990, complete genome
Human coronavirus NL63 strain NL63/human/USA/838-9/1983, complete genome
Human coronavirus NL63 strain NL63/human/USA/901-24/1990, complete genome
Human coronavirus NL63 strain NL63/human/

In [7]:
records[0]

SeqRecord(seq=Seq('GATTTAAGTGAATAGCCTAGCTATCTCACCCCCTCTCGTTCTCTTGCAGAACTC...AAA', IUPACAmbiguousDNA()), id='MG987420.1', name='MG987420', description='Middle East respiratory syndrome-related coronavirus isolate NL13892, complete genome', dbxrefs=[])

In [8]:
records[0].annotations

{'accessions': ['MG987420'],
 'data_file_division': 'VRL',
 'date': '23-FEB-2020',
 'gi': '1386872237',
 'keywords': [''],
 'molecule_type': 'RNA',
 'organism': 'Middle East respiratory syndrome-related coronavirus',
 'references': [Reference(title='Discovery of novel bat betacoronaviruses in south China', ...),
  Reference(title='Direct Submission', ...)],
 'sequence_version': 1,
 'source': 'Middle East respiratory syndrome-related coronavirus (MERS-CoV)',
 'structured_comment': OrderedDict([('Assembly-Data',
               OrderedDict([('Assembly Method', 'Lasergene Seqman v. V7.0'),
                            ('Assembly Name', 'Assembly 1'),
                            ('Sequencing Technology',
                             'Illumina; Sanger dideoxy sequencing')]))]),
 'taxonomy': ['Viruses',
  'ssRNA viruses',
  'ssRNA positive-strand viruses, no DNA stage',
  'Nidovirales',
  'Coronaviridae',
  'Coronavirinae',
  'Betacoronavirus'],
 'topology': 'linear'}

In [9]:
record_dict = Bio.SeqIO.to_dict(Bio.SeqIO.parse('Data/sequence.gb', 'genbank'))

In [20]:
print(record_dict['MN908947.3'].seq)

ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGTCGTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCTTCTTCGTAAGAACGGTAATAAAGGAGCTGGTGGCCATAGTTACGGCGCCGATCTAAAGTCATTTGACTTAGGCGACGAGCTTGGCACTGATCCTTATGAAGATTTTCAAGAAAACTGGAACACTAAACATAGCAGTGGTGTTACCCGTGAACTCATGCGTGAGCTTAACGGAGGGGCATACACTCGCTATGTCGATAACAACTTCTGTGGCCCTGATGGCTACCCTCTTGAGTGCATTAAAGACCTTCTAGCACGTGCTGGTAAAGCTTCATGCACTTTGTCCGAACAACTGGACTTTATTGACACTAAGAGGGGTGTATACTGCTGCCGTGAACATGAGCATGAAATTGCTTGGTACACGGAACGTTCT

In [21]:
from Bio.Blast import NCBIWWW

In [22]:
help(NCBIWWW.qblast)

Help on function qblast in module Bio.Blast.NCBIWWW:

qblast(program, database, sequence, url_base='https://blast.ncbi.nlm.nih.gov/Blast.cgi', auto_format=None, composition_based_statistics=None, db_genetic_code=None, endpoints=None, entrez_query='(none)', expect=10.0, filter=None, gapcosts=None, genetic_code=None, hitlist_size=50, i_thresh=None, layout=None, lcase_mask=None, matrix_name=None, nucl_penalty=None, nucl_reward=None, other_advanced=None, perc_ident=None, phi_pattern=None, query_file=None, query_believe_defline=None, query_from=None, query_to=None, searchsp_eff=None, service=None, threshold=None, ungapped_alignment=None, word_size=None, short_query=None, alignments=500, alignment_view=None, descriptions=500, entrez_links_new_window=None, expect_low=None, expect_high=None, format_entrez_query=None, format_object=None, format_type='XML', ncbi_gi=None, results_file=None, show_overview=None, megablast=None, template_type=None, template_length=None)
    BLAST search using NCBI's

In [29]:
%time result_handle = NCBIWWW.qblast("blastn", "nt", record_dict['MN908947.3'].seq)
with open("Files/my_blast.xml", "w") as out_handle:
    out_handle.write(result_handle.read())


CPU times: user 92 ms, sys: 44 ms, total: 136 ms
Wall time: 10min 14s


In [31]:
from Bio.Blast import NCBIXML

In [46]:
result_handle = open("Files/my_blast.xml")
blast_records = list(NCBIXML.parse(result_handle))

![UML diagram of the BlastRecord](Media/BlastRecord.png)

In [56]:
E_VALUE_THRESH = 0.01
print(len(blast_records[0].alignments))
for alignment in blast_records[0].alignments:
    for hsp in alignment.hsps:  # HSPs = high-scoring pairs
        if hsp.expect < E_VALUE_THRESH:
            print("\n****Alignment****")
            print("sequence  :", alignment.title)
            print("length    :", alignment.length)
            print("e value   :", hsp.expect)
            print("identities:", hsp.identities)
            print(hsp.query[0:75] + "...")
            print(hsp.match[0:75] + "...")
            print(hsp.sbjct[0:75] + "...")

50

****Alignment****
sequence  : gi|1798174254|ref|NC_045512.2| Wuhan seafood market pneumonia virus isolate Wuhan-Hu-1, complete genome >gi|1798172431|gb|MN908947.3| Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
length    : 29903
e value   : 0.0
identities: 29903
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAAC...
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||...
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAAC...

****Alignment****
sequence  : gi|1805293633|gb|MT019531.1| Severe acute respiratory syndrome coronavirus 2 isolate BetaCoV/Wuhan/IPBCAMS-WH-03/2019, complete genome
length    : 29899
e value   : 0.0
identities: 29898
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAAC...
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||...
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAAC...

***

In [58]:
from Bio import Entrez
Entrez.email = "roger.guimera@urv.cat"  # Always tell NCBI who you are
handle = Entrez.efetch(db="nucleotide", id="AY395003.1", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()

In [60]:
record.seq

Seq('AGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTAGCTGTCGCTCGGCTGCAT...AAA', IUPACAmbiguousDNA())

In [65]:
from Bio import pairwise2
from Bio.SubsMat.MatrixInfo import blosum62

In [62]:
seq1 = record.seq
seq2 = record_dict['MN908947.3'].seq

In [66]:
alignments = pairwise2.align.globalds(seq1, seq2, blosum62, -10, -0.5)

MemoryError: Out of memory

In [68]:
from Bio import Align

In [69]:
aligner = Align.PairwiseAligner()
alignments = aligner.align(seq1, seq2)

In [74]:
print(aligner)
print(aligner.algorithm)

Pairwise sequence aligner with parameters
  match_score: 1.000000
  mismatch_score: 0.000000
  target_internal_open_gap_score: 0.000000
  target_internal_extend_gap_score: 0.000000
  target_left_open_gap_score: 0.000000
  target_left_extend_gap_score: 0.000000
  target_right_open_gap_score: 0.000000
  target_right_extend_gap_score: 0.000000
  query_internal_open_gap_score: 0.000000
  query_internal_extend_gap_score: 0.000000
  query_left_open_gap_score: 0.000000
  query_left_extend_gap_score: 0.000000
  query_right_open_gap_score: 0.000000
  query_right_extend_gap_score: 0.000000
  mode: global

Needleman-Wunsch


In [77]:
from Bio.Align import substitution_matrices

In [84]:
for m in substitution_matrices.load():
    print(m, substitution_matrices.load(m))

BENNER22 #  S.A. Benner, M.A. Cohen, G.H. Gonnet:
#  "Amino acid substitution during functionally constrained divergent evolution
#  of protein sequences".
#  Protein Engineering 7(11): 1323-1332 (1994).
#  Figure 3B.
#  PMID 7700864
     A    C    D    E    F    G    H    I    K    L    M    N    P    Q    R    S    T    V    W    Y
A  2.5 -1.2 -0.2 -0.3 -3.1  0.8 -1.6 -0.4 -1.0 -1.7 -0.8  0.0  0.8 -0.9 -1.2  1.3  1.4  0.4 -5.5 -3.5
C -1.2 12.6 -3.7 -4.3 -0.1 -1.7 -1.5 -2.4 -3.3 -2.6 -2.5 -1.9 -3.1 -3.3 -1.6  0.3 -1.1 -1.7  0.5  0.6
D -0.2 -3.7  4.8  3.9 -5.4  0.7  0.3 -4.0  0.2 -4.9 -3.9  2.4 -1.8  0.6 -1.0  0.1 -0.7 -3.0 -6.4 -3.0
E -0.3 -4.3  3.9  4.6 -5.7  0.5 -0.2 -3.6  1.0 -4.4 -3.4  1.2 -1.7  1.7 -0.1 -0.5 -0.9 -2.7 -6.3 -4.0
F -3.1 -0.1 -5.4 -5.7  7.7 -5.8  0.3  0.5 -5.1  2.2  0.7 -3.5 -3.4 -3.6 -4.3 -2.2 -2.6 -0.1  0.5  5.9
G  0.8 -1.7  0.7  0.5 -5.8  6.2 -2.0 -3.8 -1.0 -4.9 -3.8  0.4 -1.8 -1.4 -0.7  0.6 -0.7 -2.5 -4.5 -4.8
H -1.6 -1.5  0.3 -0.2  0.3 -2.0  6.1 -3.2  0.8 -2.1 

In [85]:
matrix = substitution_matrices.load("SCHNEIDER")
print(matrix)

#  Adrian Schneider, Gina M. Cannarozzi, and Gaston H. Gonnet:
#  "Empirical codon substitution matrix."
#  BMC Bioinformatics 6:134 (2005).
#  Additional File 3.
#  PMID 15927081
      AAA   AAC   AAG   AAT   ACA   ACC   ACG   ACT   AGA   AGC   AGG   AGT   ATA   ATC   ATG   ATT   CAA   CAC   CAG   CAT   CCA   CCC   CCG   CCT   CGA   CGC   CGG   CGT   CTA   CTC   CTG   CTT   GAA   GAC   GAG   GAT   GCA   GCC   GCG   GCT   GGA   GGC   GGG   GGT   GTA   GTC   GTG   GTT   TAA   TAC   TAG   TAT   TCA   TCC   TCG   TCT   TGA   TGC   TGG   TGT   TTA   TTC   TTG   TTT
AAA  11.6  -2.7   9.7  -1.7  -2.7  -6.4  -3.9  -5.6   5.1  -5.0   3.6  -4.2  -6.3 -13.0  -7.1 -11.5   0.4  -6.0  -1.9  -5.3  -8.5 -11.2  -8.9 -10.8   2.1   0.0   1.4   0.2 -10.2 -13.5 -13.0 -12.5  -2.6  -8.5  -5.0  -8.1  -6.3  -9.9  -7.5  -9.0  -7.1 -10.2  -8.2  -9.2  -8.2 -12.5 -11.1 -11.4 -50.0 -14.8 -50.0 -13.8  -7.3 -10.1  -8.4  -9.1 -50.0 -13.0 -13.5 -12.4 -10.7 -18.1 -11.8 -17.2
AAC  -2.7  13.0  -3.3  10.9  -3.5  -0.4  -3.

In [86]:
aligner = Align.PairwiseAligner(matrix=matrix)
alignments = aligner.align(seq1, seq2)

In [88]:
print(aligner)

Pairwise sequence aligner with parameters
  match_score: 1.000000
  mismatch_score: 0.000000
  target_internal_open_gap_score: 0.000000
  target_internal_extend_gap_score: 0.000000
  target_left_open_gap_score: 0.000000
  target_left_extend_gap_score: 0.000000
  target_right_open_gap_score: 0.000000
  target_right_extend_gap_score: 0.000000
  query_internal_open_gap_score: 0.000000
  query_internal_extend_gap_score: 0.000000
  query_left_open_gap_score: 0.000000
  query_left_extend_gap_score: 0.000000
  query_right_open_gap_score: 0.000000
  query_right_extend_gap_score: 0.000000
  mode: global



In [87]:
print(len(alignments))

OverflowError: number of optimal alignments is larger than 9223372036854775807

In [89]:
aligner = Align.PairwiseAligner()
aligner.match_score = 1.0
aligner.mismatch_score = -2.0
aligner.gap_score = -2.5
print(aligner)
alignments = aligner.align(seq1, seq2)

Pairwise sequence aligner with parameters
  match_score: 1.000000
  mismatch_score: -2.000000
  target_internal_open_gap_score: -2.500000
  target_internal_extend_gap_score: -2.500000
  target_left_open_gap_score: -2.500000
  target_left_extend_gap_score: -2.500000
  target_right_open_gap_score: -2.500000
  target_right_extend_gap_score: -2.500000
  query_internal_open_gap_score: -2.500000
  query_internal_extend_gap_score: -2.500000
  query_left_open_gap_score: -2.500000
  query_left_extend_gap_score: -2.500000
  query_right_open_gap_score: -2.500000
  query_right_extend_gap_score: -2.500000
  mode: global



In [95]:
print(alignments[0].query[:75])

ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAAC
