# BIOPYTHON

Biopython is a package for bioinformatics tasks such as sequence manipulation, plotting, population genetics, cluster analysis, genome analysis, etc. The package also has various databases (PDB, GenBank, FASTA, etc.).

Let's check the version!

In [1]:
import Bio
print(Bio.__version__)

1.83


# 1. Basic classes

- Some basic classes in Biopython are `Seq`, `SeqRecord`, `SeqFeature`, and `SeqIO`. 

## 1.1 `Seq`

The Seq class in Biopython is particularly useful for representing and manipulating biological sequences, such as DNA, RNA, or protein sequences. It allows you to perform various operations on these sequences, such as reverse complementation, transcription, translation, and more.

In [2]:
from Bio.Seq import Seq 

seq='MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQN'
seq=Seq(seq) 

print(seq)

MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQN


In [3]:
seq.find('A') 

1

In [4]:
seq.find('QIA') 

8

In [5]:
# Create a DNA sequence
dna_sequence = Seq("ATCG")

# Get the reverse complement
reverse_complement = dna_sequence.reverse_complement()

print("Original DNA sequence:", dna_sequence)
print("Reverse complement:", reverse_complement)

Original DNA sequence: ATCG
Reverse complement: CGAT


The `reverse_complement()` method is a functionality provided by the Biopython library's Seq class for working with biological sequences, such as DNA. This method generates the reverse complement of a given sequence. Let me break down what "reverse complement" means:

    Reverse: This refers to reversing the order of the elements in the sequence. For example, if you have the sequence "ATCG," the reverse would be "GCTA."

    Complement: In the context of DNA sequences, each nucleotide has a complementary nucleotide that forms a base pair. The complementary base pairs are adenine (A) with thymine (T) and cytosine (C) with guanine (G). So, the complement of "A" is "T," and the complement of "C" is "G."

The `reverse_complement()` method combines these two operations. It first reverses the order of the sequence and then replaces each nucleotide with its complementary nucleotide, resulting in the reverse complement of the original sequence.

The reverse_complement() function is important in molecular biology and bioinformatics for several reasons:

    Base Pairing in DNA: In DNA, adenine (A) always pairs with thymine (T), and cytosine (C) always pairs with guanine (G). Knowing the reverse complement of a DNA sequence is crucial for various analyses, such as primer design for PCR (polymerase chain reaction) or sequencing experiments. When designing primers or probes, researchers often use the reverse complement to ensure specificity and efficiency.

    Sequence Analysis: When comparing or aligning DNA sequences, it is common to use the reverse complement to identify regions of similarity or homology. This is particularly important in tasks like sequence alignment, where understanding the complementary nature of DNA strands helps in recognizing conserved regions or identifying mutations.

    Transcription and Translation: In molecular biology, DNA is transcribed into RNA, and RNA is translated into proteins. During transcription, the RNA sequence is synthesized based on the DNA template, and the RNA sequence is the reverse complement of the DNA coding strand. Understanding the reverse complement is essential for predicting the transcribed RNA sequence accurately.

## 1.2 `SeqRecord`

The SeqRecord class is part of Biopython and is used to represent a biological sequence along with associated metadata. A biological sequence could be a DNA sequence, RNA sequence, protein sequence, or any other type of biological molecule.

In [6]:
from Bio.SeqRecord import SeqRecord

seq = SeqRecord(seq, id="example_id", description="Example Sequence")
print(seq)

ID: example_id
Name: <unknown name>
Description: Example Sequence
Number of features: 0
Seq('MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQN')


## 1.3 `SeqFeature`

The `FeatureLocation` class specifies the start and end positions of the feature.

In [7]:
from Bio.SeqFeature import SeqFeature, FeatureLocation
seq = SeqFeature(FeatureLocation(start=1, end=5), type="gene")
print(seq)

type: gene
location: [1:5]
qualifiers:



## 1.4 `SeqIO`

### Working with FASTA file

To learn about `SeqIO` let's create a fasta file with some records.

In [8]:
# Sequence from wikipedia: https://en.wikipedia.org/wiki/FASTA_format

seq_test = '''
>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*
'''

# Open the file in write mode ('w') and write the content
with open('calmodulin.fasta', 'w') as file:
    file.write(seq_test)

Now let's use `SeqIO` to read the content of the file.

In [9]:
from Bio.SeqIO import parse 
# or use `from Bio import SeqIO` and use `SeqIO.parse`

file = open("calmodulin.fasta") 

records = parse(file, "fasta") 

for record in records:  # we have only one record in the file  
   print("Id: %s" % record.id) 
   print("Name: %s" % record.name) 
   print("Description: %s" % record.description) 
   print("Sequence Data: %s" % record.seq) 

Id: MCHU
Name: MCHU
Description: MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
Sequence Data: MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREADIDGDGQVNYEEFVQMMTAK*


# 2. `Bio.Data`

In Biopython, `Bio.Data` is a module that provides access to various data resources and dictionaries used in bioinformatics analyses. 

## 2.1 `IUPACData`

In Biopython, `IUPACData.ambiguous_dna_complement` is a dictionary that provides the complement of ambiguous DNA sequences. The International Union of Pure and Applied Chemistry (IUPAC) has defined a set of codes to represent ambiguity in DNA sequences where a particular position could be one of multiple nucleotides. The complement of an ambiguous DNA sequence is the sequence obtained by replacing each nucleotide with its complement.

For example, the IUPAC ambiguity codes for DNA include:

    R: A or G
    Y: C or T
    M: A or C
    K: G or T
    S: C or G
    W: A or T
    H: A, C, or T
    B: C, G, or T
    V: A, C, or G
    D: A, G, or T
    N: Any nucleotide (A, C, G, or T)

In [10]:
from Bio.Data import IUPACData 
import pprint 

# The pprint.pprint function to print the complement data in a more organized and readable format. 
pprint.pprint(IUPACData.ambiguous_dna_complement)

{'A': 'T',
 'B': 'V',
 'C': 'G',
 'D': 'H',
 'G': 'C',
 'H': 'D',
 'K': 'M',
 'M': 'K',
 'N': 'N',
 'R': 'Y',
 'S': 'S',
 'T': 'A',
 'V': 'B',
 'W': 'W',
 'X': 'X',
 'Y': 'R'}


In [11]:
from Bio.SeqUtils import IUPACData

complement_of_R = IUPACData.ambiguous_dna_complement["R"]
print(f"Complement of 'R' is: {complement_of_R}")

complement_of_R = IUPACData.ambiguous_dna_complement["A"]
print(f"Complement of 'A' is: {complement_of_R}")

Complement of 'R' is: Y
Complement of 'A' is: T


## 2.2 Codon table

A codon table can be used to translate a genetic code into a sequence of amino acids. The standard genetic code is traditionally represented as an RNA codon table, because when proteins are made in a cell by ribosomes, it is messenger RNA (mRNA) that directs protein synthesis. The mRNA sequence is determined by the sequence of genomic DNA. In this context, the standard genetic code is referred to as translation table. It can also be represented in a DNA codon table. 

from [Wikipedia](https://en.wikipedia.org/wiki/DNA_and_RNA_codon_tables)

Let's print the DNA codon table using `Bio.Data` 

In [12]:
from Bio.Data import CodonTable 
table = CodonTable.unambiguous_dna_by_name["Standard"] 
print(table) 

Table 1 Standard, SGC0

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I   | ACT T   | AAT N   | AGT S   | T
A | ATC I   | ACC T   | AAC N   | AGC S   | C
A | ATA I   | ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V   | GCG A   | GAG E   | GGG G   | G
--+---------

`TAA` (ochre), `TAG` (amber), and `TGA` (opal) are the three stop codons in the genetic code. When the ribosome encounters one of these stop codons during translation, it signals the end of the protein-coding region and initiates the process of translation termination.

In [13]:
from Bio.Data import CodonTable

# Specify the genetic code (e.g., standard genetic code with ID 1)
genetic_code = CodonTable.unambiguous_dna_by_id[1]

# Get the mapping between codons and amino acids
codon_amino_acid_mapping = genetic_code.forward_table

# Print the mapping
for codon, amino_acid in codon_amino_acid_mapping.items():
    print(f"Codon: {codon}, Amino Acid: {amino_acid}")

Codon: TTT, Amino Acid: F
Codon: TTC, Amino Acid: F
Codon: TTA, Amino Acid: L
Codon: TTG, Amino Acid: L
Codon: TCT, Amino Acid: S
Codon: TCC, Amino Acid: S
Codon: TCA, Amino Acid: S
Codon: TCG, Amino Acid: S
Codon: TAT, Amino Acid: Y
Codon: TAC, Amino Acid: Y
Codon: TGT, Amino Acid: C
Codon: TGC, Amino Acid: C
Codon: TGG, Amino Acid: W
Codon: CTT, Amino Acid: L
Codon: CTC, Amino Acid: L
Codon: CTA, Amino Acid: L
Codon: CTG, Amino Acid: L
Codon: CCT, Amino Acid: P
Codon: CCC, Amino Acid: P
Codon: CCA, Amino Acid: P
Codon: CCG, Amino Acid: P
Codon: CAT, Amino Acid: H
Codon: CAC, Amino Acid: H
Codon: CAA, Amino Acid: Q
Codon: CAG, Amino Acid: Q
Codon: CGT, Amino Acid: R
Codon: CGC, Amino Acid: R
Codon: CGA, Amino Acid: R
Codon: CGG, Amino Acid: R
Codon: ATT, Amino Acid: I
Codon: ATC, Amino Acid: I
Codon: ATA, Amino Acid: I
Codon: ATG, Amino Acid: M
Codon: ACT, Amino Acid: T
Codon: ACC, Amino Acid: T
Codon: ACA, Amino Acid: T
Codon: ACG, Amino Acid: T
Codon: AAT, Amino Acid: N
Codon: AAC, 

# 3. Processing a pdb file

In the context of a PDB (Protein Data Bank) file and structural biology, a "chain" refers to a continuous sequence of amino acid or nucleotide residues in a biological macromolecule. These macromolecules can include proteins, nucleic acids (DNA or RNA), and other large biological complexes. Each chain is identified by a unique chain identifier character.

    Chain Identifier:
        In a PDB file, each chain is assigned a one-character identifier, typically an uppercase letter (A, B, C, etc.). 

## 3.1 Simple pdb file

Here is a simple example of a pdb file with three chains, each with a dipeptide.

```
HEADER    SIMPLE PDB EXAMPLE                     8-MAR-2024
TITLE     DIPEPTIDE STRUCTURE                  

ATOM      1  N   ALA A   1      10.000  20.000  30.000  1.00  0.00
ATOM      2  CA  ALA A   1      11.000  19.000  30.000  1.00  0.00
ATOM      3  C   ALA A   1      12.000  19.000  31.000  1.00  0.00
ATOM      4  O   ALA A   1      12.000  18.000  32.000  1.00  0.00

ATOM      5  N   GLY B   2      14.000  22.000  30.000  1.00  0.00
ATOM      6  CA  GLY B   2      15.000  21.000  30.000  1.00  0.00
ATOM      7  C   GLY B   2      16.000  21.000  31.000  1.00  0.00
ATOM      8  O   GLY B   2      16.000  20.000  32.000  1.00  0.00

ATOM      9  N   SER C   3      18.000  24.000  30.000  1.00  0.00
ATOM     10  CA  SER C   3      19.000  23.000  30.000  1.00  0.00
ATOM     11  C   SER C   3      20.000  23.000  31.000  1.00  0.00
ATOM     12  O   SER C   3      20.000  22.000  32.000  1.00  0.00

TER
END
```

    HEADER: Provides general information about the structure.
    TITLE: Describes the content of the structure.
    ATOM: Provides atomic coordinates for each atom in the structure. Columns represent atom type, atom serial number, atom name, residue name, chain identifier, residue sequence number, X, Y, and Z coordinates.
    TER: Marks the end of a chain.
    END: Marks the end of the PDB file.

In this example, there are three chains (A, B, and C), each representing a dipeptide. Chains are delineated by the TER (terminate) record.

In [14]:
pdb_data = """
HEADER    SIMPLE PDB EXAMPLE                     8-MAR-2024
TITLE     DIPEPTIDE STRUCTURE                  

ATOM      1  N   ALA A   1      10.000  20.000  30.000  1.00  0.00
ATOM      2  CA  ALA A   1      11.000  19.000  30.000  1.00  0.00
ATOM      3  C   ALA A   1      12.000  19.000  31.000  1.00  0.00
ATOM      4  O   ALA A   1      12.000  18.000  32.000  1.00  0.00

ATOM      5  N   GLY B   2      14.000  22.000  30.000  1.00  0.00
ATOM      6  CA  GLY B   2      15.000  21.000  30.000  1.00  0.00
ATOM      7  C   GLY B   2      16.000  21.000  31.000  1.00  0.00
ATOM      8  O   GLY B   2      16.000  20.000  32.000  1.00  0.00

ATOM      9  N   SER C   3      18.000  24.000  30.000  1.00  0.00
ATOM     10  CA  SER C   3      19.000  23.000  30.000  1.00  0.00
ATOM     11  C   SER C   3      20.000  23.000  31.000  1.00  0.00
ATOM     12  O   SER C   3      20.000  22.000  32.000  1.00  0.00

TER
END
"""

# Specify the file path
file_path = "test.pdb"

# Write the PDB data to the file
with open(file_path, "w") as pdb_file:
    pdb_file.write(pdb_data)

print(f"PDB data written to {file_path}")

PDB data written to test.pdb


In [15]:
from Bio import PDB

# Load the PDB file
parser = PDB.PDBParser(QUIET=True)
structure = parser.get_structure("example_structure", "./test.pdb")

# Count the number of chains
num_chains = len(list(structure.get_chains()))

print(f"Number of chains: {num_chains}")

Number of chains: 3


## 3.2 Download `1xyz.pdb` from the PDB database

In [16]:
from Bio import PDB
from Bio.PDB import MMCIFParser, PDBIO

# PDB ID for the structure
# https://www.rcsb.org/structure/1xyz
# A COMMON PROTEIN FOLD AND SIMILAR ACTIVE SITE IN TWO DISTINCT FAMILIES OF BETA-GLYCANASES
pdb_id = "1XYZ" 

# Create a PDBList object
pdb_list = PDB.PDBList()

# Specify the download path
download_path = "./"

# Download the PDB file
pdb_file_path = pdb_list.retrieve_pdb_file(pdb_id, pdir=download_path)

print(f"PDB file downloaded to: {pdb_file_path}")
# The default download format has changed from PDB to PDBx/mmCif

# CIF file path
cif_file_path = "./1xyz.cif"

# PDB file path
pdb_file_path = "./1xyz.pdb"

# Parse the CIF file
cif_parser = MMCIFParser(QUIET=True)
structure_from_cif = cif_parser.get_structure("my_structure", cif_file_path)

# Save the structure from CIF to PDB format
pdbio = PDBIO()
pdbio.set_structure(structure_from_cif)
pdbio.save(pdb_file_path)

# Parse the PDB file
pdb_parser = PDB.PDBParser(QUIET=True)
structure_from_pdb = pdb_parser.get_structure("my_structure", pdb_file_path)

# Iterate through the structure to extract information
for model in structure_from_pdb:
    for chain in model:
        print(f"Chain: {chain.id}")
        
        for residue in chain:
            print(f"  Residue: {residue.resname} {residue.id[1]}")
            
            for atom in residue:
                print(f"    Atom: {atom.name} - Coordinates: {atom.coord}")

Structure exists: './1xyz.cif' 
PDB file downloaded to: ./1xyz.cif




Chain: A
  Residue: ASN 516
    Atom: N - Coordinates: [41.511 25.152 36.876]
    Atom: CA - Coordinates: [40.907 25.555 35.563]
    Atom: C - Coordinates: [39.684 24.707 35.106]
    Atom: O - Coordinates: [39.191 24.916 34.001]
    Atom: CB - Coordinates: [41.97  25.57  34.422]
    Atom: CG - Coordinates: [43.166 26.528 34.694]
    Atom: OD1 - Coordinates: [43.247 27.183 35.761]
    Atom: ND2 - Coordinates: [44.122 26.565 33.756]
  Residue: ALA 517
    Atom: N - Coordinates: [39.186 23.778 35.939]
    Atom: CA - Coordinates: [38.027 22.952 35.538]
    Atom: C - Coordinates: [36.73  23.719 35.746]
    Atom: O - Coordinates: [36.707 24.699 36.48 ]
    Atom: CB - Coordinates: [37.986 21.654 36.326]
  Residue: LEU 518
    Atom: N - Coordinates: [35.647 23.273 35.115]
    Atom: CA - Coordinates: [34.366 23.944 35.28 ]
    Atom: C - Coordinates: [33.984 24.033 36.768]
    Atom: O - Coordinates: [33.537 25.08  37.228]
    Atom: CB - Coordinates: [33.276 23.197 34.525]
    Atom: CG - Coordina

    Atom: C - Coordinates: [23.554 28.59  81.163]
    Atom: O - Coordinates: [23.955 29.755 81.131]
    Atom: CB - Coordinates: [24.993 26.531 81.655]
    Atom: CG - Coordinates: [24.823 25.369 82.628]
    Atom: CD - Coordinates: [23.361 25.405 82.979]
  Residue: LYS 697
    Atom: N - Coordinates: [22.676 28.105 80.285]
    Atom: CA - Coordinates: [22.13  28.957 79.229]
    Atom: C - Coordinates: [21.305 30.094 79.822]
    Atom: O - Coordinates: [21.386 31.229 79.353]
    Atom: CB - Coordinates: [21.28  28.157 78.257]
    Atom: CG - Coordinates: [21.049 28.902 76.92 ]
    Atom: CD - Coordinates: [20.106 28.129 76.033]
    Atom: CE - Coordinates: [19.905 28.803 74.657]
    Atom: NZ - Coordinates: [18.918 28.046 73.852]
  Residue: SER 698
    Atom: N - Coordinates: [20.481 29.799 80.831]
    Atom: CA - Coordinates: [19.683 30.833 81.494]
    Atom: C - Coordinates: [20.57  31.938 82.137]
    Atom: O - Coordinates: [20.271 33.152 82.04 ]
    Atom: CB - Coordinates: [18.777 30.216 82.574]
 