# **Biopython**

If you join a lab or a biotech company, you may well have to deal with many biological sequences. However, using the string type in Python to represent sequences would be quite cumbersome, because it does not provide tools specific to biological sequence analysis. 

There are many biologists who have experienced the same problem and they have shared their own modules and packages. Eventually, people began to collect the most commonly used tools into a coherent package, Biopython. It provides tools useful for many biological analyses, not just sequence analysis, but also 3D protein structure analysis and statistical analysis specific to biology.

In this and the next lectures, we are entering into the territory of real scientific programming. I will show you, as a showcase, how to use Biopython to manipulate biological sequences. You will see a number of classes and functions in these lectures.  Note that **you do not have to memorize them**. Unlike basic Python data structures, such as Lists or Dictionaries that are universally used across the entire Python community, these classes are specific to biological sequence analysis. Even seasoned scientists may not memorize all those functions except those they most frequently use. They constantly google to find useful functions and copy & paste code snippets. Therefore, **it is important to know which functions 'exist' or at least where you can find them**. The chance is that unless you're solving a completely new problem, there is a solution on the internet. Finally, it is very important to **test every single code you copy & paste before using it** in any serious program. A small grammatical error during copy & paste may generate un-recognizable errors. Thus, whenever you use other people's code, test it in a short program and deploy it after you confirmed working.

---

We will begin this lecture with standard file formats of sequence data.

---


## **FASTA format**

A significant part of doing biology involves dealing with files holding biological data, such as the genome, protein sequence, and so on. A FASTA file contains a sequence(s) of the genome (nucleotides) or proteins (amino acids). The first line beginning with ">" is a description line. If you downloaded a FASTA file from the NCBI genome database, the first alphanumeric word is a unique ID for the sequence. It is followed by other information such as the name of species, chromosome number, etc.

There can be multiple sequences in a single FASTA file. In this case, there are multiple description lines beginning with ">".

See [Wikipedia](https://en.wikipedia.org/wiki/FASTA_format) for more details.

You can search download a FASTA file of a species from [NCBI](https://www.ncbi.nlm.nih.gov/). In the search box, select "Genome" and put a keyword, for example, E. coli. You can download, near the top in the result page, a FASTA file of the genome of E. coli.

The description line of the FASTA file needs to be "parsed", to extract meta information. You can write your own parsing script. But there are many tools out there for R, Ruby, Python, Perl, and so on. We will use the [Biopython](http://biopython.org/) package.


## **GenBank format**

GenBank file is especially useful because NCBI provides extensive annotation in GenBank format. 

See [Wikipedia](https://en.wikipedia.org/wiki/GenBank) for more details.

Most functions introduced in this class for FASTA files can be used to GenBank files exactly the same way, except that the data stored in Biopython can be different.

### **Dealing with FASTA and GenBank files**

More details follow.

- `SeqIO` object: Parses biological data files in various formats.

- `SeqIO.parse()`: the main function to parse files. It works like the `open` function for files. It returns a SeqRecord object.

- `SeqRecord` object: It holds a `Seq` object with other metadata in the data file.

In [None]:
from Bio import SeqIO

for seq_record in SeqIO.parse('GCF_000006745.1_ASM674v1_genomic.fna','fasta'): # Use 'genbank' option for genbank files
    print(seq_record)
    print('--------------')
    print(seq_record.seq[:100])
    print(len(seq_record))
    print('----------------------------')

print('Type of seq_record:', type(seq_record))
print('Type of seq:', type(seq_record.seq))

    
#dir(SeqIO)         # uncomment it to see all methods
#dir(seq_record)    # uncomment it to see all member variables

---

# **Biopython**

Biopython provides a huge set of classes and modules useful for biological sequence analyses. If you ever need to do sequence analysis, this package will help you in many ways.  To fully take advantage of this package, you will need to study the [Documentation](https://biopython.org/docs/1.78/api/Bio.html) and [Tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.html). Covering the entire package in a couple of classes is impossible. In this class, we will focus on essential classes that help to manipulate biological sequences.

<sub><sub>Disclaimer: This notebook contains part of the official tutorial, especially Chapter 3, 4, and 5.</sub></sub>

In [None]:
# Version check
import Bio
print(Bio.__version__)

# Getting help for Biopython functions and classes
#from Bio import SeqIO
#help(SeqIO)

## **`Seq` class**

The `Seq` class is the most basic class in Biopython. It behaves very similar to string and supports most string functions. Additional methods useful for sequence manipulation are also provided.

- `Seq.complement()`: Generates a complementary sequence of DNA
- `Seq.reverse_complement()`: Same as complement() but in the reverse order
- `Seq.count(pattern)`: Returns a non-overlapping count (as for string)
- `Seq.count_overlap(pattern)`: Returnes an overlapping count
- `Seq.join()`: works like string.join()
- `Seq.lower()`: returns a lower case sequence
- `Seq.upper()`: returns an upper case sequence

Run `help(Seq)` for the list of methods available for the Seq object.

In [None]:
from Bio.Seq import Seq
#help(Seq)                # Uncomment this line to see the full list of supported methods

ex_seq = Seq('AGTACACTGGT')
ex_seq

In [None]:
print(ex_seq)

In [None]:
print(ex_seq.complement())
print(ex_seq.reverse_complement())
print(ex_seq)     # The original sequence does not change

In [None]:
print(  "AAAA".count("AA")  )       # count function of string
print(  Seq("AAAA").count("AA")  )  # coutn function of Seq

In [None]:
print(Seq("AAAA").count_overlap("AA"))
print(Seq("ATATATATA").count_overlap("ATA"))
print(Seq("ATATATATA").count_overlap("ATA", 3, -1))    # index for start (3) and end (-1)

In [None]:
# slicing of Seq is identical to string

print(ex_seq[2::2])
print(ex_seq[:3])
print(ex_seq[::-1])  # reversing the string

In [None]:
# You may change Seq to string

s = str(ex_seq)
s

In [None]:
# Concatenation is similar to string concatenation
ex_seq1 = Seq('AGTACACTGGT')
ex_seq2 = Seq('TCGAACTTGAT')

c = ex_seq1 + ex_seq2
print(c)

# Or you can use join method
print(Seq("N"*10).join([ex_seq1, ex_seq2]))

In [None]:
# Changing case
print(ex_seq.lower())
print(ex_seq)
ex_seq3 = Seq('atgcttggac')
print(ex_seq3.upper())

# Useful to run case-sensitive matching
print("CTT" in ex_seq3)
print("CTT" in ex_seq3.upper())


### **Transcription**

Transcription happens in the template strand in the reverse order. However, in Biopython, the transcription only changes T to U. Thus, you should use the coding strand to get a proper mRNA sequence.

- `Seq.transcribe()`: Generates an mRNA sequence from coding strand of DNA
- `Seq.back_transcribe()`: Generates coding strand of DNA from mRNA sequence


In [None]:
template_strand = Seq('CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT')
coding_strand = template_strand.reverse_complement()
mRNA = coding_strand.transcribe()
print(mRNA)

# One line operation
mRNA2 = template_strand.reverse_complement().transcribe()
print(mRNA2)

# Back transcription
coding_dna = mRNA.back_transcribe()
print(coding_dna)

### **Translation**
- `Seq.translate()`: Generates a sequence of amino acids
Note that you can use a different genetic code table. The default is the *standard* code. But mitochondrial sequences may use a different table. See [NCBI](https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi) for more details.

In [None]:
from Bio.Seq import Seq
mRNA = Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG')
print(mRNA.translate())
# '*' means stop codon. Biopython does not stop translation at the stop codon to provide
# more information to consecutive analyses. To stop translation, use an option to_stop
print(mRNA.translate(to_stop = True))

In [None]:

# using a different genetic code
mRNA.translate(table="Vertebrate Mitochondrial")
# Note that the stop codon in the middle disappeared.

In [None]:
# Bacterial gene
gene = Seq("GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCAGCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGATAATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACATTATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCATAAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA")

print(gene.translate(table="Bacterial"))
# In baterial genome, GTG is a valid start codon and should be translated as M (methionine).
# To that end, you should tell Biopython that the sequence is a complete sequence beginning
# with a start codon and ending with a stop codon.
print(gene.translate(table="Bacterial", cds=True))
# Note that the first letter has been changed from V to M. Also, it omits '*'.

In [None]:
# Genetic code can be examined
from Bio.Data import CodonTable
standard_table = CodonTable.unambiguous_dna_by_name["Standard"]
mito_table = CodonTable.unambiguous_dna_by_name["Vertebrate Mitochondrial"]
bac_table = CodonTable.unambiguous_dna_by_name["Bacterial"]

print(standard_table)
print('============================================================')
print(mito_table)
print('============================================================')
print(bac_table)

In [None]:
# codon table gives you a few variables with useful information
dir(mito_table)
# See "back_table", "forward_table", etc.

In [None]:
# This was not covered in the lecture, but it is quite useufl.
mito_table.stop_codons

In [None]:
mito_table.forward_table

In [None]:
help(Seq)

---

## **`MutableSeq` class**

`Seq` is not mutable. Use `MutableSeq` to change any letter in the sequence. `MutableSeq` operates very differently from `Seq`. Most functions, such as `reverse_complement()`, **would directly change the sequence of the object**.



In [None]:
from Bio.Seq import MutableSeq
mu_seq = MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA')
print(mu_seq)

# Or you can convert Seq to MutableSeq
mu_ex_seq = ex_seq.tomutable()


mu_seq[5] = 'C'  # directly modifies the sequence
print(mu_seq)
mu_seq.reverse_complement()  # directly modifies the sequence
print(mu_seq)


ex_seq_from_mu = mu_seq.toseq()  # converts to an immutable Seq object
print(ex_seq_from_mu)
ex_seq_from_mu[5] = 'T'  # generates an error

---

## **`SeqRecord` class**

This is a simple class that wraps the `Seq` class and stores metadata. It has a few member variables.

- `SeqRecord.seq`: A sequence. Typically `Seq` object
- `SeqRecord.id`: Typically, a unique ID of the sequence (such as from the FASTA file)
- `SeqRecord.name`: Name of the sequence
- `SeqRecord.description`: A short description. If read from a FASTA file, it is the first line of the file.
- `SeqRecord.annotations`: A dictionary to store additional information about the sequence.
- `SeqRecord.letter_annotations`: A dictionary to store information of every letter in the sequence, typically numbers, to represent the quality of each letter in the sequence.
- `SeqRecord.format()`: Generates a string in a specific format

See documentation for more information. (Or you can use dir() to see the list.)

You could use this class to store any sequence that you may want to tag, comment, record, make notes, and so on.  You will save the record into your own FASTA file using SeqIO to share the information of your sequence, which will be shown below.

In [None]:
from Bio.SeqRecord import SeqRecord

my_seq_rec = SeqRecord(Seq('ATGCTGTA'))
my_seq_rec.id = 'my_unique_id_389101932'
my_seq_rec.name = 'Pythonrandseq'
my_seq_rec.description = 'An awesome random sequence'
print(my_seq_rec)

In [None]:
# You can use class's init feature

my_seq_rec = SeqRecord(Seq('ATGCTGTA'),
                      id = 'my_unique_id_389101932',
                      name = 'Pythonrandseq',
                      description = 'An awesome random sequence')

print(my_seq_rec)

In [None]:
# You can use "annotations" to store additional information. It is a dictionary. It is up to you to make any useful keywords
my_seq_rec.annotations['my_note']='experiment on 1/18/21: sequencing was successful'
my_seq_rec.annotations['weather']='was sunny in Santa Barbara'
my_seq_rec.annotations

# However, if you import other gene files such as GenBank files, it may already have
# some annotations filled in. NCBI has a strict rule for the format and type of data.

In [None]:
# Unlike annotations, "letter_annoations" is for every single letter in the sequence.
# Therefore, the length of the value should match with the number of letters in the sequnce.

my_seq_rec.letter_annotations["phred_quality"] = [40, 40, 38, 30,42, 37, 32, 44]
print(my_seq_rec.letter_annotations["phred_quality"])

In [None]:
# Use 'format()' function to generate a string in a specific format. This can be used
# when you save the sequence data into a file.
my_seq_rec.format('fasta')

In [None]:
print(my_seq_rec.format('fasta'))

---

## **`SeqIO` class**

Use `SeqIO` to read and write files in standard formats, such as FASTA and GenBank. Use help to get more information.
```
from Bio import SeqIO
help(SeqIO)
```

`SeqIO` uses `SeqRecord` to store or extract relevant information during I/O.

- `SeqIO.read()` : parses standard gene files. Useful for a single record. See a section below *Parsing directly from NCBI*
- `SeqIO.parse()`: parses standard gene files. This is for many records. It returns an iterator that you can use in the for-loop.


In [None]:
for seq_record in SeqIO.parse('GCF_000006745.1_ASM674v1_genomic.fna', 'fasta'):
    print(seq_record.id)
    print(seq_record.description)
    print('Length:', len(seq_record))
    print('---------------------------------')


In [None]:
# You can use comprehension to collect specific information

[ seq_record.id for seq_record in SeqIO.parse('GCF_000006745.1_ASM674v1_genomic.fna', 'fasta')]

In [None]:
# To store the entire records in the file

from Bio import SeqIO

records = list(SeqIO.parse('GCF_000006745.1_ASM674v1_genomic.fna', 'fasta')) # makes a list

print("Found", len(records), "records")

print("The last record")
last_record = records[-1]  # using Python's list tricks
print(last_record.id)
print(last_record.description)

print("The first record")
first_record = records[0]  # remember, Python counts from zero
print(first_record.id)
print(last_record.description)


In [None]:
# For this particular species (Vibrio cholerae), the name is in the 2nd and 3rd words
# So the name can be accessible with the following method

tmp = records[0].description.split()[1:3]  # splits the description and pick 2nd and 3rd words
records[0].name = ' '.join(tmp)  # joins two word with a space.
print(records[0].name)

### **Writing sequence into a file**

- `Seq.write(records, filename, format)`: Saves records of sequences into a file using a format
- `Seq.convert(file1, format1, file2, format2)`: Converts the file format from file1 to file2. Note: Using `parse()` then `write()` has some complicated issues. Use `convert()` to convert the file format. Furhter, some conversion is impossible, because different formats contain different information.

In [None]:
from Bio import SeqIO
records = list(SeqIO.parse('GCF_000006745.1_ASM674v1_genomic.fna', 'fasta')) # makes a list

# Write
SeqIO.write(records, 'test_save_file.fna', 'fasta')

# Read it
new_records = list(SeqIO.parse('test_save_file.fna', 'fasta'))

In [None]:
SeqIO.convert("NM_001316525.1.gbk", "genbank", "test_conversion.fna", "fasta")

### **Parsing directly from NCBI**

See the example below to see the basic usage.

In [None]:
from Bio import Entrez
Entrez.email = "your_email_address"

In [None]:
# FASTA format
handle = Entrez.efetch(db="nucleotide", id="NM_206006.2", rettype="fasta", retmode="text")
#print(handle.read())  # This prints all data. Too long!

# To parse directly,
fasta_seq_record = SeqIO.read(handle, 'fasta')
print(fasta_seq_record)

In [None]:
# GenBank format gives you more annotation
handle = Entrez.efetch(db="nucleotide", id="NM_206006.2", rettype="gbwithparts", retmode="text")
### IMPORTANT: Not mentioned in the lecture. If the GenBank record is too long, only
###            the summary of the record is returned.  If you GenBank record is long and 
###            you still want a full record, use rettype="gbwithparts" as shown here,
###            instead of "gb". It will return the full GenBank record regardless of
###            the length.

#print(handle.read())


# Reading GenBank format from NCBI does not always returns proper sequence data.
# Below, the sequence data is missing.
# Therefore, sometimes, you may want to download both FASTA and GenBank data then combine them.
gb_seq_record = SeqIO.read(handle, 'genbank')
print(gb_seq_record)

We can combine these routines into a neat function that first saves the information into a file and read it in.

In [None]:
import os
from Bio import SeqIO
from Bio import Entrez
Entrez.email = "your_email_address"  # Always tell NCBI who you are

In [None]:
# This function retrieves GenBank record of the id.

def retrieve_GenBank_info(id):
    filename = id + ".gbk"

    print("Downloading GenBank information of", id, "...")
    net_handle = Entrez.efetch( db="nucleotide", id=id, rettype="gbwithparts", retmode="text" )
    ### IMPORTANT: Unlike the lecture, here we use rettype="gbwithparts" to save the full
    ###            GenBank record.
    
    out_handle = open(filename, "w")
    out_handle.write(net_handle.read())
    out_handle.close()
    net_handle.close()
    print("Saved")

    print("Parsing...")
    # Here, we use "read()" instead of "parse()", because we know this is a single record.
    # If you modify this function so that 'id' is a list of genes, you may want to use
    # parse() instead and put it in the for-loop as we did above in the first section of "SeqIO".
    record = SeqIO.read(filename, "genbank")
    return record

In [None]:
# Here are a couple of examples.

ids=[]
ids.append("NM_001316525.1") # ID of D. melanogaster Dop1R1 variant F (dopamine receptor)
ids.append("NM_206006.2")  # D.melanogaster Brain-specific homeobox (bsh)

for id in ids: 
    record = retrieve_GenBank_info(id)
    print('----------------------------------------')
    print('ID:', record.id)
    print('   ', record.description)
    print('----------------------------------------')
    print(record.seq[:500])
    print('-------------------------------------------------------------------------')
    

---

# **Summary**

It was a long lecture. But the bottom line is simple: (1) We now know how to retrieve biological sequences from the world's largest database (copy & paste any code you think useful for your work), (2) We also know that retrieved sequences and the metadata can be accessed using the `Seq` class and the `SeqRecord` class. We know that there are useful functions in these classes to easily manipulate the information (you don't have to remember them, but it is good to know what kind of functions are available and where to find how to use them).

So, what do we do with this?  Well, you are now at a position where you can begin your own sequence analysis. Now everything depends on what scientific questions you have. 
