Skip to content

structuralbioinformatics/SBILib

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Welcome to the StructuralBioInformatics pylib

The SBILib python library is a continuous work in progress that clusters functionalities to work with data derived from PDB, from creating a local PDB repository to do real Blast searches over PDB sequences (i.e. sequences of the really crystallized protein sections) to perform a simply split chains (separate the sections of a PDB file).

Requirements:

The library is not exempt of bugs (probably), and can not be considered a final version. This means that one must use it at its own risk. And, while new functionalities are bound to appear, others might disappear at any given moment. Any comments, complains or bug reports can be addressed in the corresponding sections.

SBILib Library Manual

This manual contains a short turorial split into 6 common Tasks.
Information on how to access more detailed help pages within python is provided at the end.
We have made available a list of slightly more complex scenarios on which the SBILib library may be applied. In Scenarios.md (https://github.com/structuralbioinformatics/SBI)

Dependencies

python3
scipy
numpy
DSSP
BLAST
CD-Hit

Installation:

SBILib has been distributed on PyPi and can be installed with pip.

pip install SBILib

This command will install the package locally.

It can also be directly cloned from our github repo:

git clone https://github.com/structuralbioinformatics/SBI

Alternatively it is also possible to navigate directly to the repository https://github.com/structuralbioinformatics/SBI and copy the folder manually.

Testing:

We have implemented unit testing. After installation these scripts can be run to ensure that the package is functioning as it should. These scripts are located in the directory "tests". Due to size limits we werent able to provide the database file needed for BLAST through PyPi. This file, will be automatically downloaded from https://www.ebi.ac.uk/uniprot/download-center once the unit tests are conducted. Later users may define their own Blast database file as outlined in the BLAST section of this tutorial. IMPORTANT before running ensure that external dependencies are properly configured as per TASK 2.2.

pip show SBILib | grep Location

# Location: /location/of/SBILib/locally

cd /location/of/SBILib/locally/SBILib/tests

python Test.py

#........('TEST_PDB1A3Q_A.66.pdb', 'TEST_PDB1A3Q_A.66.dssp')
#truncated chain!TEST_PDB1A3Q_A.66.dssp
#.('TEST_PDB2RAM_A.38.pdb', 'TEST_PDB2RAM_A.38.dssp')
#('TEST_PDB2RAM_B.26.pdb', 'TEST_PDB2RAM_B.26.dssp')
#('TEST_PDB1A3Q_A.30.pdb', 'TEST_PDB1A3Q_A.30.dssp')
#truncated chain!TEST_PDB1A3Q_A.30.dssp
#('TEST_PDB1A3Q_B.75.pdb', 'TEST_PDB1A3Q_B.75.dssp')
#truncated chain!TEST_PDB1A3Q_B.75.dssp
#..
#----------------------------------------------------------------------
#Ran 11 tests in 67.249s
#
#OK

TASK 1

1) loading in the appropriate modules

The following code demonstrates how a PDB file can be loaded:

The first step is to load in the necessary modules.
PDB link will allow us to grab a protein from the protein data bank. While the PDB module will allow us to do some basic tasks on the protein.


from SBILib.databases import PDBlink
from SBILib.structure import PDB

2) Fetching PDBs

In the next step we create an instance of the PDBlink class and use it to get a pdb from the databank with the coresponding accession code.
Here we have selected protein "1a3q".
Note that the 'path/to/pdb' will be saved in the variable 'path'. This can then be read in as a new PDB object.
If the pdb file is stored locally the address can be provided directly without the need to fetch it from the databank.


new = PDBlink()
path = new.get_PDB("1a3q")
protein1 = PDB(path)

Additionally predicted PDBs can also be fetched from the AlphaFold2 model database.


from SBILib.databases import AlphaFoldlink

alphaPath = AlphaFoldlink.download_pdb("P31946")

AlphModel = PDB(alphaPath)

3) Reading out chains in a protein

First I will load in a second protein this time providing the address directly, and then have a look at the chains contained within the protein.
We can look at exacty what chains are contained within the protein.
Here our protein contains 2 chains identified as A and B


protein2 = PDB(path/to/pdbfile.pdb)
protein2.chain_identifiers
# set(['A', 'B'])

4) Selecting other elements of PDB

Now that we have seen how to select Chains of a protein we can also select the amino acids.
We will renumerate the atoms and residues so that they start at 1.


ChainA = protein.get_chain_by_id("A")

ChainA.renumerate_residues(1)
ChainA.renumerate_atoms(1)

ChainA.aminoacids

This list all of the amino acids in the Chain and its atoms.
The selection can be more specific by selecting one amino acid from the list.
We can access the number (the position of the amino acid on the chain)
The type of amino acid (three letter code and single letter).
The coordinates of the O, C, CA and CB (if present)
The coordinates of the backbone atoms.


ChainA.aminoacids[0].number
#1
ChainA.aminoacids[0].type
#'GLY'
ChainA.aminoacids[0].single_letter
#G
ChainA.aminoacids[0].n 
#<AtomOfAminoAcid: [N, 1]:(-5.833, 72.589, 45.643)>
ChainA.aminoacids[0].o 
#<AtomOfAminoAcid: [O, 4]:(-5.436, 72.093, 42.983)>
ChainA.aminoacids[0].c
#<AtomOfAminoAcid: [C, 3]:(-5.850, 73.226, 43.203)>
ChainA.aminoacids[0].ca
#<AtomOfAminoAcid: [CA, 2]:(-5.902, 73.707, 44.647)>
ChainA.aminoacids[0].cb
#
ChainA.aminoacids[0].backbone_atoms
#[<AtomOfAminoAcid: [N, 1]:(-5.833, 72.589, 45.643)>, <AtomOfAminoAcid: [CA, 2]:(-5.902, 73.707, 44.647)>, <AtomOfAminoAcid: [C, 3]:(-5.850, 73.226, 43.203)>]



There are other features such as secondary structure and accessibility information which is only available once the dssp of the chain has been calculated (see Task 2).

Alternatively we can also accesss the atoms of the amino acid by their index.
We see that the first amino acid has 4 atoms
We see the name and number of the atom
We can then access the x,y,and z coordinate of the atom
We can also see the occupancy


len( ChainA.aminoacids[0].atoms)
#4
ChainA.aminoacids[0].atoms[0].name
#N
ChainA.aminoacids[0].atoms[0].number
#1
ChainA.aminoacids[0].atoms[0].x
#-5.833
ChainA.aminoacids[0].atoms[0].y
#72.589
ChainA.aminoacids[0].atoms[0].z
#45.643


5) Duplicating protein and fusing chains

We can duplicate the protein as a new object with the duplicate command.
Then if we want to we could fuse chains "A" and "B" into a new chain "A"
Note that you will not be able to fuse two different kinds of structural chains.


protein3 = protein2.duplicate()
protein3 = protein3.fuse_chains(["A","B"])
protein3.chain_identifiers
# set(["A"])

6) Adding a chain to a protein

Now let us imagine a scenario where we wanted to take chain A from protein2 and add it to protein1.
First we have to pull out the chain that we want (A) from protein2.
Then by checking the chain Identifiers in protein1 we can see that it already has a chain "A".
This means that we will have to change the identifier of the chain we wish to add. Lets make it "E".
First we will have to duplicate the chain, creating a new object (ChainE), so that we dont alter the chain Ids of protein2.
Now we can add the chain to protein1
If we check the identifiers again we will see that a new chain "E" has been added to the protein.


ChainA = protein2.get_chain_by_id("A")
protein1.chain_identifiers
# set(['A', 'C', 'B', 'D'])
ChainE = ChainA.duplicate()
ChainE.chain = "E"
protein1.add_chain(ChainE)
protein1.chain_identifiers
# set(['A', 'C', 'B', 'D', 'E'])

7) Removing chain(s) from a protein

Now we would like to delete this chain again:
First we set up a temporary empty PDB object and call it protein3.
Then we will define a list of the chain identifiers we wish to remove (in this case only "E")
First we loop through all of the chains ids in protein1.
If that protein is not in the deletion list we provided...
We add it to the newly created protein3.
We can then redifine protein1 to leave it without the unwanted chains.

 
protein3 = PDB()
deletion = ["E"]
for chain in protein1.chain_identifiers:
     if chain not in deletion:
             protein3.add_chain(protein1.get_chain_by_id(chain))
             if (protein1.get_chain_by_id(chain).chaintype == "P"):
             	protein3._has_prot = "TRUE"
             elif (protein1.get_chain_by_id(chain).chaintype == "N"):
             	protein3._has_nucl = "TRUE"

protein1 = protein3

8) Adding multiple chains to a protein

Addition of multiple chains can also be accomplished.
First we have to grab the second chain from protein2 and change its id like we did to ChainA.
Then we change the identifiers of both chains as we have done previously
Then we create a list of the chains that we wish to add
We then add this list of chains to protein1


ChainB = protein2.get_chain_by_id("B")
ChainF = ChainB.duplicate()
ChainF.chain = "F"
chainlist = [ChainE,ChainF]
protein1.add_chains(chainlist)
protein1.chain_identifiers
# set(['A','B','C','D','E','F'])

You can revert back to the original protein by repeating step 6 and adding "F" to the deletion list.

9) Reading out protein sequence (gapped and exact)

We can now have a look at the protein sequence
First we can have a look at the exact sequence of the protein without gaps indicated (not containing gaps of non-crystalized positions).
Then we look at the gapped version (default). Xs will be placed where the crystal sequence contains gaps.
We can also read out the sequence of each chain individually as a gapped sequence (as shown with ChainA)
We can also read out the non gapped sequence of each chain as shown

ChainA = protein1.get_chain_by_id("A")
protein1.FASTA_format(gapped=False)
# "protein sequence in FASTA format"
protein1.FASTA_format()
# "gapped protein sequence in FASTA format"
ChainA.gapped_protein_sequence
# "gapped protein sequence for Chain A only"
ChainA.protein_sequence
# "The exact sequence in the crysal without non crystalized positions indicated"

10) Getting DNA sequence

In addition to protein sequences the PDB may contain DNA
We can check if this is the case for protein1
In the case where this is true we can read out the sequences in Fasta format

protein1.has_nucleotide
# True
for chain in protein1.nucleotides:
	print(">" + chain.globalID + "\n" + chain.nucleotide_sequence())
# >"globalID"
# "Nucleotide Sequence"

11) Cleaning protein and writing PDBfile

We can clean the protein with the following command
We can then write the PDB of the protein out to a file name that we provide


protein1.clean()
protein1.write("protein1.pdb")

TASK 2

1) Dependencies

The following application of this SBILib library requires DSSP to be run

2) Configuration

In the configSBI.txt file are stored the paths to the executable necessary.
This file is located in SBILib/external and looks something like this.
Here only dssp is shown but the format holds for other dependencies (BLAST, HMMER etc.)


[dssp]
executable     = mkdssp
path           = /path/to/the/executable/folder

If a config file is already available it can be used as well.
This file can be added to the scripts with the following commands


import os 
os.environ['SBI_CONFIG_FILE'] = "path/to/new/configSBI.txt"

3) Calculating a chains secondary structure

First we load in a new protein and grab one of the chains as in the previous chapter
We can then calculate the secondary structure of the chain
After this is done the secondary structure can be printed out.


from SBILib.structure import PDB
protein = PDB("pdb1a3q.ent.gz")
ChainA = protein.get_chain_by_id("A")
ChainA.calculate_dssp()
ChainA.gapped_protein_secondary_structure
# "gapped secondary structure"

The secondary structure annotation follows the DSSP classification:

G = 3-turn helix (310 helix). Min length 3 residues.
H = 4-turn helix (α helix). Minimum length 4 residues.
I = 5-turn helix (Ď€ helix). Minimum length 5 residues.
T = hydrogen bonded turn (3, 4 or 5 turn)
E = extended strand in parallel and/or anti-parallel β-sheet conformation. Min length 2 residues.
B = residue in isolated β-bridge (single pair β-sheet hydrogen bond formation)
S = bend (the only non-hydrogen-bond based assignment).
C = coil (residues which are not in any of the above conformations).

4) printing out the entire proteins amino acid sequence and secondary structure


for chain in protein.proteins:
	chain.calculate_dssp()
	print(">" + chain.globalID + "\n" + chain.gapped_protein_sequence + "\n" + chain.gapped_protein_secondary_structure)

# >"globalID"
# "gapped amino acid sequence"
# "gapped secondary structure"


This Gives us an overview of the protein. We see the Amino Acid sequence and the DSSP predicted Secondary structure all together.

TASK 3

1) Set up a blast

In order to blast you must have a database
This is a multifasta file containing protein or DNA sequences depending on the blast type.
Reformating of the multifasta file will be done automatically once it is provided.
The blast object will now know that the blast is to be conducted on this database.
If the unittest was run after installation a database file will be located in the tests folder (see Testing section)


from SBILib.external.blast import BlastExe
blast = BlastExe(database = "path/to/database/file.fa")

2) execute blast

Once the blast is set up we need a query sequence to pass.
For this example we will blast chainA of protein1 against our database.
With these variables we can now execute the blast and save the to "blastresults".
The result is an instance of type "BlastResult".


querySequence = ChainA.protein_sequence
queryID = ChainA.globalID
blastresults = blast.execute_query_seq(sequenceID = queryID, sequence = querySequence)
type(blastresults)
# <class 'SBILib.external.blast.BlastResult.BlastResult'>

3) View results

There are several ways that we can view the results of the blast.
Firstly we can print out a compacted version of all of the results.
We could also focus to select points of interest (eg. hit_id and evalue).


blastresults.print_compacted_blast()
# query and target info, e-value, coverage etc. 

for hit in blastresults.get_hits():
	print(hit.sequenceID, " ", hit.e_value)
# list of all hits and their associated e-values.

The compacted blast output is divided into the following collumns:
Query ID, Query length, Target ID, Accession length, Identities, Positives, Gaps, e_value , Query sequence, Target sequence, Format Positions
The varying outputs here have the following meaning:

Query ID = ID of provided sequence
Query length = length of provided sequence
Target ID = id of the hit sequence
Accession length = length of the hit sequence
Identities = number of identical amino acids between the query and target
Positives = The number of amino acids that are either identical between the query and the subject sequence or have similar chemical properties
Gaps = number of gaps in the alignment
e_value = the expected number of hits expected to be seen by chance given the size of the blast database.
Query sequence = provided sequence
Target sequence = hit sequence
Format Positions = a single string in which fragment definitions will be represented as follows:
Query segment 1 start : Hit segment 1start, Query segment 1 end : Hit segment 1 end,Query segment 2 start : Hit segment 2 start, Query segment 2 end : Hit segment 2 end etc.

4) Filter results

If we want to evaluate whether the proteins fall within the Rost Curve twilight zone,
we can parse through the list and print it off similarly to how it was done above.
If we want to only consider hits passing the rost evaluation we can pass that as an argument to the "get_hits" command


for hit in blastresults.get_hits():
	print(hit.sequenceID, " ", hit.evaluate_Rost_twilight_zone())
# indication of wich proteins would pass the test

for hit in blastresults.get_hits(tz_type = "ID"):
	print(hit.sequenceID, " ", hit.evaluate_Rost_twilight_zone())
# only hits that pass the test will be shown


4) PIR

PIR formatted alignments are available for every hit.
We can select hit number 13 with the result parameter
We could add in the filters to grab the PIR of alll of the best results


blastresults.str_PIR(result = 13)
# PIR alignment
for hit in  blastresults.get_hits( evalue = 0, tz_type = "ID"):
	blastresults.str_PIR(result = hit._num_seq)
# All PIRs with hits that match the criteria given 

5) Find Results with matching secondary structure

The secondary structure of a protein plays a vital role in its function.
The SBILib package allows for the quick viewing of how secondary structure is represented in an alignment.
Are vital secondary structure element located in a gap resulting in a homolog with unequal function?
Here we demonstrate how this can be analyzed.


blastresults.str_PIR(result = 13)


#>P1;1A3Q_A
#sequence:1A3Q_A:2:.:285:.:.:.:.:.
#PYLVIVEQPKQRGFRFRYGCEGPSHGGLPGASSEKGRKTYPTVKICNYEGPAKIEVDLVT
#HSDPPRAHAHSLVGKQCSELGICAVSVGPKDMTAQFNNLGVLHVTKKNMMGTMIQKLQRQ
#RLRSRPQGLTEAEQRELEQEAKELKKVMDLSIVRLRFSAFL------RSLPLKPVISQPI
#HDSKSPGASNLKISRMDKTAGSVRGGDEVYLLCDKVQKDDIEVRFYEDDENGWQAFGDFS
#PTDVHKQYAIVFRTPPYHKMKIERPVTVFLQLKRKRGGDVSDSKQFTYYP*
#>P1;REL
#structureX:REL:16:AVIRE:289:AVIRE:.:.:.:.
#PYIEIFEQPRQRGTRFRYKCEGRSAGSIPGEHSTDNNKTFPSIQILNYFGKVKIRTTLVT
#KNEPYKPHPHDLVGKGCRD-GYYEAEFGPERQVLSFQNLGIQCVKKKDLKESISLRISK-
#--KINPFNVPEEQLHNIDE--------YDLNVVRLCFQAFLPDEHGNYTLALPPLISNPI
#YDNRAPNTAELRICRVNKNCGSVKGGDEIFLLCDKVQKDDIEVRFVLGN---WEAKGSFS
#QADVHRQVAIVFRTPPFLG-DITEPITVKMQLRRPSDQAVSEPVDFRYLP*


This wil output a PIR format alignment of the two proteins. Dashes in the target protein (second sequence of the PIR) indicate gaps in the alignment. We are interestsed in seeing what secondary structure elements (if any) may have gone missing as a result of these gaps.


string = blastresults.str_PIR(result = 13)
indices = [i for i, c in enumerate("".join(string.split(">")[2].split("\n")[2:]))if c == "-"]
i = 0    
string= ""
for char in ChainA.gapped_protein_secondary_structure:
	if i in indices:
		string = string + char
	else:
		string = string + "-"
	i+=1


lines = textwrap.wrap(string,60)
print("\n".join(lines))

#------------------------------------------------------------
#-------------------T---------------------------------------
#HHH-----------------HHHHHHHT--------------------------------
#--------------------------------------------------SSS-------
#----------------------------------------------------

Here we can see what secondary structures go missing as a result of the alignment.
To remind ourselves of the query proteins secondary structure lets look at it again:

lines = textwrap.wrap(ChainA.gapped_protein_secondary_structure, 60)
print("\n".join(lines))


#--EEEEEE-B-SSSB--EETTT-S-S----BS---TT-----EEEEET--
#SSEEEEEEEE-SSSS--B-SSEEEETTB-TTS-EEEEE-SS--EEE---EEEEE--
#TTTHHHHHHHHHHHHHTTTS-S---HHHHHHHHHHHHHHHTT--TTEEEEEEEEEE-
#xxxxxx-EE---EE---EEBTTSTTTS---EEEES-SEEETT---EEEEEESS--
#TTTEEEEEEE-SSS-EEEE-B--GGGEETTTEEEEE----S-TT-SS-EEEEEEEEETTT
#--B---EEEEEE-



We can see that for this Homolog according to our alignment half of a 4-turn helix is missing in our target protein.
Should this element be of functional significance we would have to continue our search for a more suitable homolog.

Task 4

1) Complex

For this exercise we will use the protein DNA complex that we used in the previous tasks.
In order to extract the interfaces from a protein complex we will first have to create an object of the complex class.
In createing a new complex class we need to provide the protein complex.
We can grab the individual chains of the complex straight from this class using the .pdb method if we wanted.


from SBILib.structure import Complex
from SBILib.structure import PDB
protein = PDB("pdb1a3q.ent.gz")
complex = Complex(protein)
complex.pdb
# PDB object


Here we have used the default definitions for contacts within the complex which are as follows.

Complex(pdb, biomolecule=False, PPI=True, PPI_type='cb', PPI_distance=12, PNI=True, PNI_type='min', PNI_distance=8, PHI=True, PHI_type='min', PHI_distance=6)

Protein Protein interaction:
type: beta carbons
distance: 12 Angstrom

Protein Nucleotide interaction:
type: minimum
distance: 8 Angstrom

Protein Heteroatom interaction:
type: minimum
distance: 6 Angstrom

However these can be altered:


complex = Complex(protein, PPI_type='ca', PPI_distance=14)

The available contact types are:
Protein Protein interaction:
minimum (the distance between the closest pairs of atoms of each residue)
alpha carbons (distance between the alpha carbons of the two residues)
beta carbons (distance between the beta carbons of the two residues)
geometric (distance between the geometric centers of the two residues)
backbone (distance between the center of the backbone atoms between the two residues)
Protein Nucleotide interaction:
minimum
geometric
backbone
Protein Heteroatom interaction:
minimum
geometric

2) PPI

The interfaces are already calculated and present within the Complex object.
The results of the Protein Protein Interfaces are stored in a list and accessible with the PPInterfaces method
We can count the number of interfaces produced by looking at the length of the list (here just 1)
To access the interfaces we can simply index the list
The results can quickly be summarized by converting to a string.
Alternatively we can enter the interface for more detailed info.
The interface contains information on all of the contacts in the form of objects of the contact class. We can find the amino acid identifiers of all of the contact residues by iterating through the interface object (here we simply selected the first element of the list because there are no more interfaces).
We can then print each amino acids position within its chain and the single letter code identifier as well as the coordinates for the alpha carbon of that amino acid.
(aminoacid1 is the first chain provided and aminoacid2 is the second chain/interactor)
We can then print the geometric distance between the two amino acids.

print(complex.PPInterfaces[0].toString())
#Chain1, Chain2, threshold_type(default cb), threshold distance. 

len(complex.PPInterfaces)
# 1

complex.PPInterfaces[0]
# Interface object

for contact in complex.PPInterfaces[0].contacts:
	print(contact.aminoacid1.identifier, contact.aminoacid1.single_letter, contact.aminoacid1.ca.coordinates)
	print(contact.aminoacid2.identifier, contact.aminoacid2.single_letter, contact.aminoacid2.ca.coordinates)
	print(contact.geometric_distance)
# complete list of the contacts, coordinates and distances in this interface

2) PNI

We can calculate Protein Nucleotide interfaces as well.
The interfaces are accessible with the PNInterfaces method.
When we take a look at the interface list we see that this time we have multiple results.
These can be accessed by indexing or iterated through.
We can then gather the position and identifiers for the amino acid and nucleotides. As well as the alpha carbon coordinate in the case of the amino acid and the phosphate coordinate in the case of the nucleotide.

len(complex.PNInterfaces)
# 4 

for contact in complex.PNInterfaces[0].contacts:
	print(contact.nucleotide.identifier, contact.nucleotide.single_letter, contact.nucleotide.p.coordinates)
	print(contact.aminoacid.identifier, contact.aminoacid.single_letter, contact.aminoacid.ca.coordinates)
	print(contact.geometric_distance)
# complete list of the contacts, coordinates and distances in this interface

3) PHI

In order to find the Protein Heteroatoms interface you must first check whether your chains contains heteroatoms.
This can be checked with the has_heteroatoms method from the chain class.
I will load in a pdb file that I know has heteroatoms
The PHInterface will mix the residue types between the two chains.
For this reason we will find one that is of type aminoacid so that we can check the type within the contact object.
Now that we know which is which we can control for the classes in the iterations.
In the case of an amino acid residue we can print the same information from the previous examples.
In the case that the residue is a heteroatom we will print the identifier, all of the residues atoms (which contain the coordinats as well), we can also print the coordinates seperately with the all_coordinates method.

for chain in protein.chains:
	print(chain.has_heteroatoms)
# False
# False
# False
# False


protein = PDB("1a2w.pdb")
for chain in protein.chains:
	print(chain.has_heteroatoms)
# True
# True

complex = Complex(protein)
len(complex.PHInterfaces)
# 1

complex.PHInterfaces[0].contacts
# list of contacts

type(list(complex.PHInterfaces[0].contacts)[0].aminoacid)
# type of the first residue

type(list(complex.PHInterfaces[0].contacts)[0].heteroatom)
# type of the second Residue which will be opposite of the first (heteroatom residue vs aminoacid)


for contact in complex.PHInterfaces[0].contacts:
	if (contact.aminoacid.mode == "ATOM"):
		print(contact.aminoacid.identifier, contact.aminoacid.single_letter, contact.aminoacid.ca.coordinates)
		print(contact.heteroatom.identifier, contact.heteroatom.atoms, contact.heteroatom.all_coordinates)
		print(contact.geometric_distance)
	else:
		print(contact.aminoacid.identifier, contact.aminoacid.atoms, contact.aminoacid.all_coordinates)
		print(contact.heteroatom.identifier, contact.heteroatom.single_letter, contact.heteroatom.ca.coordinates)
		print(contact.geometric_distance)
# complete list of the contacts, coordinates and distances in this interface


Task 5

Protein loop geometry

We will load in the pdb we have worked with before and have a look at the loop geometry in its chain "A".


protein = PDB("pdb1a3q.ent.gz")
ChainA = protein.get_chain_by_id("A")

In order to visualize the loop geometry we need to first have an idea of the proteins secondary structure.
Like before this is done with DSSP and can be called within the SBILib library.
Once the secondary structures are resolved we can calculate the loop geometry.
In the SBILib library loops are refered to as archs.


ChainA.calculate_dssp()
ChainA.calculate_archs()
ChainA.archs

#[**Secondary Structure Relation:
#STR1:	( E )  39  <--  6 -->  44 
#	f11:  43  cmf11:              [-3.84483333 62.7425     28.42483333] eigf11:              [ 0.34916202 -0.59272239 -0.72578651]
#STR2:	( E )  54  <--  2 -->  55 
#	f44:  55  cmf44:              [ 7.85366667 71.586       7.988     ] eigf44:              [ 0.04348626  0.87377397 -0.48438414]
#ARCH TYPE: BK
#INTERNAL SS: 0
#DISTANCE: 22.978424484459143
#THETA: 98.69428 RHO: 74.28570 DELTA: 58.26752
#**



The output of the archs command is a list of the loop geometries calculated for the given chain (ChainA)
We get the first secondary structure that flanks the loop. In parantheses the structure type (see Task2.3) it's start, length and end.
An axis for flanking secondary structures 1 defined based on the shortest of the principal moments of inertia for that structures.
The same information is printed out for the second flanking structure.
The arch Type is printed:
BK: beta-link
BN: beta-hairpin
EG: beta-helix310
EH: beta-alpha helix
GE: helix310-beta
GG: helix310-helix310
GH: helix310-alpha helix
HE: alpha helix-beta
HG: alpha helix-helix310
HH: alpha helix-alpha helix

INTERNAL SS = the number of secondary structures contained in the loop.
Distance = The euclidean distance between the beginning and end of the loop.
Theta: the angle between the axes of secondary structure 1 and 2.
Delta: the angle between the axis of secondary structure 1 and the vector of distance between beginning and end of loop.
Rho: the angle between the secondary structure 2 axis and the plane that contains the axis of secondary structure 1.

To get larger loops which may contain secondary structures:


ChainA.superarchs

The output follows the same format as the previous.

Task 6

Loop Grafting

We will show how SBILib may be used to graft the loop of one protein into the structure of another protein with similiar loop geometry.


new = PDBlink()
path = new.get_PDB("2ram")
protein2 = PDB(path)
report = protein.compare_loops(self.protein2)
report[0]

# 'QueryProt:A,176,182:TargetProt:A,160,166:VISQPIH,VLSHPIF'

These Commands took the protein previously defined (1a3q here QueryProt) and for each loop within that protein searched for loops in the TargetProt (2ram) with matching geometries. These pairs are saved in a list (here defined by the variable report). Each element (loop matches) of that list is composed of a string with the format seen above. QueryProt:A refers to the first loop (from the query protein) being located on chain A. 176,182 the start and end positions of that loop respectively. TargetProt:A the second loop is found on chain A of the target protein. 160,166 our start and end positions. VISQPIH,VLSHPIF are the sequences for the Query protein and Target protein loops respectively. We define loop geometries to be similar if all of the following conditions are met:

Rho = math.sqrt((loop1.rho - loop2.rho)**2) <= 10 
Delta = math.sqrt((loop1.delta - loop2.delta)**2) <= 5
Theta = math.sqrt((loop1.theta - loop2.theta)**2) <= 5
Distance = math.sqrt((loop1.cartesian_distance - loop2.cartesian_distance)**2) <= 0.5 

That is to say if the difference between each type of angle and distance is less than the thresholds of (10 Rho angle, 5 Delta angle, 5 Theta angle, 0.5 Cartesian distance)

Now we tell the protein to graft the target protein loop into the query protein using the first match found in the previous output list. Note that if the loops are of unequal length this will not work.


graft = self.protein.graft(self.protein2, report[0])

Details on classes

If you wish to check the functions available for any class, they are available with pytons dir method.
first create an instance of the class you wish to inspect.
then call the dir method on that instance. What follows is a list of all of the functions of the PDB class
to get more detailed information call the help methdod.


protein = PDB("pdb1a3q.ent.gz")
dir(protein)

#['FASTA_IDX', 'FASTA_format', 'IDX_format', 'PDB_format', '_COMPND', '_NMR', '_NMR_chains', '__abstractmethods__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_apply_matrix', '_biomol_id', '_chain_id', '_chains', '_cif_file', '_get_chain_position_by_id', '_has_nucl', '_has_prot', '_header', '_pdb_file', '_read_PDB_file', '_read_mmCIF_file', '_write_PDB_file', 'add_chain', 'add_chains', 'all_models', 'apply_biomolecule_matrices', 'apply_symmetry_matrices', 'biomolecule_identifier', 'chain_exists', 'chain_identifiers', 'chains', 'clean', 'dehydrate', 'dump', 'duplicate', 'fuse_chains', 'get_chain_by_id', 'has_nucleotide', 'has_protein', 'header', 'id', 'is_NMR', 'is_all_ca', 'load', 'non_standard_chains', 'nucleotides', 'pdb_file', 'proteins', 'read', 'repeated_chain_ids', 'reposition', 'rotate', 'tmpclean', 'translate', 'write']

help(protein)

This is the result:

PDB

class PDB(SBILib.beans.StorableObject.StorableObject)
A {PDB} is a collection of {Chain}

Methods defined here:

| FASTA_IDX(self, protein=True, nucleotide=False)
|
| FASTA_format(self, gapped=True, protein=True, nucleotide=False)
|
| IDX_format(self, protein=True, nucleotide=False)
|
| PDB_format(self, clean=False, terminal=True)
| Strings a {PDB} in PDB format
| @rtype: String
|
| init(self, pdb_file=None, dehydrate=False, header=False, onlyheader=False, biomolecule=False)
| @type pdb_file: String
| @param pdb_file: PDB formated file to read
|
| @raise IOError if pdb_file does not exist and it is not an empty object
| len(self)
| # OVERRIDE DEFAULT METHODS
|
| add_chain(self, chain, NMR=False)
| Adds a new chain to the PDB
|
| add_chains(self, chains, NMR=False)
| Adds a new chains to the PDB
|
| apply_biomolecule_matrices(self, keepchains=False, water=True)
| Only works if the PDB file is an original PDB file or
| the matrices have been added in the correct PDB format
| @rtype: {PDB}
|
| apply_symmetry_matrices(self)
| Only works if the PDB file is an original PDB file
| or the matrices have been added in the correct PDB format
| @rtype: {PDB}
|
| chain_exists(self, chain)
| Confirms if a given chain exists in the PDB
| @rtype: Boolean
|
| clean(self)
|
| dehydrate(self)
| # METHODS
|
| duplicate(self, hetero=True, water=False, NMR=False)
| Returns a {PDB} identical to the original but as a new object
| @rtype: {PDB}
| fuse_chains(self, chains_ids)
| Fuses several chains into the first one.
| It will not allow to fuse different structural chains.
| It does not alter the {PDB}, but provides a new one
| @rtype: {Chain}
|
| @raise AttributeError if:
| a) A given chain ID is not present
| b) Try to fuse different structural chains
|
| get_chain_by_id(self, id)
| Returns a chain according to its id or None if no chain with that id is found
| @rtype: {Chain}
|
| reposition(self, matrix=None, vector=None)
| Rotates and Translates each {Chain} according to a matrix and a translational vector
|
| @type matrix: numpy.matrix
|
| @type vector: numpy.array
|
| rotate(self, matrix=None)
| Rotates each {Chain} according to a given matrix
|
| @type matrix: numpy.matrix
|
| tmpclean(self, cluster_by_alternative_id=False)
| Makes a clean version of the PDB, rechaining in order and renumerating atoms.
| Renumbering residues is optional
|
| translate(self, vector=None)
| Translates each {Chain} according to a translational vector
|
| @type vector: numpy.array
|
| write(self, output_file=None, format='PDB', force=False, clean=False)
| Writes the object in a specific format
|
| @type output_file: String
| @param output_file: File to write
|
| @type format: String
| @param format: Format of the file to print
|
| ----------------------------------------------------------------------
| Static methods defined here:
|
| read(input_file, format='PDB')
| Reads a file of data in a specific format and returns the object
|
| @type input_file: String
| @param input_file: File to read
|
| @type format: String
| @param format: Format of the file to read
|
| ----------------------------------------------------------------------