Skip to content

Package to model peptide substrates bound to annotated protease structures and run promiscuity analysis

License

Notifications You must be signed in to change notification settings

rochoa85/Modelling-Protease-Substrates

Repository files navigation

Modelling and sampling of protease substrates

Package to model peptide substrates bound to annotated protease structures

  • From publication "An automated protocol for modelling peptide substrates to proteases"
  • BMC Bioinformatics, 2020
  • Authors: Rodrigo Ochoa, Mikhail Magnitov, Roman A. Laskowski, Pilar Cossio, Janet M. Thornton

Purpose

The goal of these scripts is to provide tools for modelling natural substrates bound to proteases with structural information available. The first script is focused on the modelling, which includes protocols for the modification of non-natural amino acids by natural counterparts, and the modelling of missing regions within the 8-mer peptide region that binds the protease binding site. In addition, a second script allows the modelling of any peptide based on the templates obtained from the first script, and the run of dynamic analysis and calculation of structural observables averages reported in the paper, including the accessible surface area (ASA) and the interface interaction energy.

Third-party tools required:

Three of the required tools can be installed from the source code or through creating a conda environment:

The BioPython and RDKit modules can be also installed directly from package repositories. Modeller can be installed freely after obtaining an academic license. However, with the following commands you can create a virtual environment with conda:

conda config --add channels salilab
conda create -c rdkit -n model-prot rdkit biopython matplotlib scipy pip modeller
source activate model-prot

For the other two packages, it is recommended to compile the source code:

DSSP requires to be compiled using the source code of the latest version. For the Rosetta functionalities, the recommended is to follow the installation instructions and take note of the Rosetta version that will be provided in the script.

Extra files and databases required:

To run the modelling script, a file and a database are required to perform the analysis. These are:

  1. components.cif (file with information of compounds and modified ligands in the PDB). The file can be downloaded from: ftp://ftp.wwpdb.org/pub/pdb/data/monomers/components.cif (272 MB)

NOTE: Put the file into the auxiliar folder of the code project

  1. MEROPS mySQL database (mySQL schema containing the information required to identify enzyme substrates). The database can be downloaded from: ftp://ftp.ebi.ac.uk/pub/databases/merops/current_release/meropsweb121.tar.gz (727 MB)

NOTE: To install the mysql database you can follow these instructions:

  • Enter your mySQL environmet and run the command:

mysql> CREATE DATABASE merops;

  • Then from the command line, import the schema as:

mysql -u username -p merops < merops.sql You can put here the username and the password after prompted in the terminal. The user, password and MEROPS database name should be provided in the script to run the analysis succesfully

How to run the modelling script

The basic command line to run the script is:

model_substrate_protease.py [-h] [-s STRUCTURE] [-f FAMILY] [-n MAX_SUBSTRATES] [-m MODE] [-t SIM_THRESHOLD] [-r ROSETTA] -u USER -p PASSWORD -d DATABASE

where the arguments are:

arguments:
  -h, --help         show this help message and exit
  -s STRUCTURE       Add a protease structure PDB id to see if it is possible
                     to model substrates
  -f FAMILY          Select the family of proteases used as reference to model
                     the substrates from 1) serine, 2) cysteine
  -n MAX_SUBSTRATES  Define the maximum number of substrates that can be
                     modelled per structure available
  -m MODE            Choose a mode to run the script from two options: 1)
                     ready, 2) complete. Ready are the peptide templates with
                     natural amino acids, and Complete are the peptides
                     containing at least one amino acid
  -t SIM_THRESHOLD   Similarity threshold to decide which amino acid will
                     replace the NNAA present in the template
  -r ROSETTA         Version of Rosetta that will be implemented
  -u USER            User to access the local installation of the MEROPS
                     database
  -p PASSWORD        Password to access the local installation of the MEROPS
                     database
  -d DATABASE        Name of the local installation of the MEROPS database

The required arguments are information to access the mySQL MEROP database. For the other parameters the script has default values, including a PDB of reference available in the annotated proteases, or just run the models for all the families and structures present in the dataset. However, please be aware of changing the Rosetta version through the flag or directly in the script by default.

Examples

Model a peptide based on a substrate composed of natural amino acids

The modelling of substrates bound to proteases has two main modes. One involves the selection of proteases bound to peptides composed of natural amino acids (called ready in the script). The second mode is when the substrate has one non-natural amino acids (NNAA) to modify. The following is a command example for the first case:

python3.5 model_substrate_protease.py -f serine -n 1 -m ready -u user -p password -d merops

In this case, we are allowing the modelling of only one substrate per structure available in the dataset bound to peptides with natural amino acids. For this script two families are available: serine and cysteine proteases. Here we selected serine as the reference family. Finally the MEROPS credentials are provided to look for reported substrates

After that, the models will be stored in the models/model_ready folder with the name [structure]\_[substrate sequence].pdb, where the structure is the PDB id and the substrate sequence is the peptide that was fully modelled using the protocol. A report of the model is provided in the following form:

pdb,pepTemplate,pepModel,chain,merops,old_aa,new_aa,pos_aa,uniprot
1smf,-CAKSI--,SCAKSIIG,I,S01.151,THR,ALA,3,O60256
...

Here the fragment CAKSI was modelled into the 8-mer peptide SCAKSIIG, which is part of the substrate protein with UniProtKB id O60256. The model was performed using the protease structure with PDB id 1smf that belongs to the MEROPS family S01.151 (trypsin). In addition, an amino acid from the original structure had to be changed by another one reported in the substrate, in this case a threonine by an alanine in the position 3 of the peptide.

Model a peptide based on a substrate containing one non-natural amino acid (NNAA)

For the second case, we require the modification of a NNAA for a natural amino acid based on a similarity threshold defined by the user (called complete in the script). It means that we can model a substrate depending on how similar we want the template and new amino acids be. The following is an example based on a particular PDB structure available in the dataset:

python3.5 model_substrate_protease.py -s 1tps -f serine -n 1 -m complete -t 0.4 -u user -p password -d merops

Here we are modelling a substrate based on the template present in the structure with PDB id 1tps. The natural amino acid that will replace the present NNAA require to be at least 40% similar based on the Tanimoto comparison of the amino acid side chains. We selected serine proteases as the reference family, and the MEROPS credentials are also provided to look for reported substrates.

After that, the model is stored in the models/model_complete folder with the name [structure]\_[substrate sequence]\_modelled.pdb, where the structure is the PDB id and the substrate sequence is the peptide that was fully modelled using the protocol. A report of the model is provided in the following form:

pdb,pepTemplate,pepModel,chain,merops,old_aa,new_aa,pos_aa,uniprot,sim
1tps,-LTREL--,FLTRELAE,B,S01.151,DLE,LEU,247,P23396,1.0

Here the fragment LTREL was modelled into the 8-mer peptide FLTRELAE, which is part of the substrate protein with UniProtKB id P23396. The model was performed using the protease structure with PDB id 1tps that belongs to the MEROPS family S01.151 (trypsin). In addition, a NNAA from the original structure had to be changed by another one reported in the substrate, in this case the residue DLE by an L-Leucine, with a similarity of 100%.

How to run the sampling script

After having modelled 8-mer peptides in protease reference structures, it is possible to call the second script for modelling any peptide of interest, run a simulation of the system using the backrub method from Rosetta and calculate structural descriptors from the trajectory. The basic command line to run the script is:

run_dynamic_proteases.py [-h] -p PATH -s SEQUENCE -c CHAIN [-r ROSETTA]

where the arguments are:

arguments:
  -h, --help   show this help message and exit
  -p PATH      Path of the structure that will be used to run the analysis
  -s SEQUENCE  Sequence of the peptide that will be modelled and sampled
  -c CHAIN     Chain identifier of the peptide in the structure
  -r ROSETTA   Version of Rosetta that will be implemented

The required arguments are the path of the model that we want to use as template, the sequence of the peptide that will be modelled, and the chain of the peptide in the structure of reference. Please be aware of changing the Rosetta version through the flag or directly in the script by default.

Run the sampling of a modelled peptide using Rosetta

To run the dynamic analysis of a peptide of reference, we can call the script as:

python3.5 run_dynamic_proteases.py -p models/model_ready/1smf_SCAKSIIG_modelled.pdb -s TGYHKLPR -c B

Here we provide the path of the modelled structure that will be used as reference, the new peptide sequence (TGYHKLPR), and the chain identifier of the peptide, which is B in this case. The modelled peptide is subjected to Backrub sampling for a defined number of Monte Carlo steps (this can be changed in the script), and the snapshots are used to calculate per amino acid in the peptide two structural observables: the accessible surface area using DSSP, and the interaction energy using the interface scoring of Rosetta. The new modelled peptide is stored in the route dynamic/models, and the observables per amino acid are stored in the folders dynamic/observables/asa_[peptide sequence] and dynamic/observables/energy_[peptide sequence].

The averages per peptide are reported in the file final_averages_[peptide sequence].txt, with the following format (example with peptide TGYHKLPR):

Amino_acid Position Average_ASA Average_Energy
THR	1	0.72	-2.57
GLY	2	0.45	-2.17
TYR	3	0.28	-6.23
HIS	4	0.014	-9.59
LYS	5	0.31	-5.67
LEU	6	0.28	-4.31
PRO	7	0.52	-2.12
ARG	8	1.0	-1.17

The third column represents the average ASA for each amino acid in the peptide, and the four column the average energy. The script can be embedded in a loop to calculate the observables for a set of peptides of interest, as in the case of the random libraries mentioned in the publication.

Support

In case the protocol is useful for other research projects and require some advice, please contact us to the email: rodrigo.ochoa@udea.edu.co

About

Package to model peptide substrates bound to annotated protease structures and run promiscuity analysis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages