GitHub - sambitmishra0628/AR-PRED_source: Predicting allosteric and active site residues in proteins with machine learning and protein sequence, structure and dynamics features

Active and Regulatory site PREDiction (AR-PRED)

AR-PRED is a prediction method to detect putative regulatory (allosteric) and active site residues in a given protein. The AR-PRED package contains separate models for predicting these two residue types: active and allosteric. Based on their requirement, the user can use either or both models to make predictions for a given protein.

Workflow

Instructions for Running AR-PRED

The code for AR-PRED is written in Perl and MatLab and tested on a 64-bit Linux machine (RHEL, release 6.8). It is designed to be run on a Linux machine of similar or higher configuration. A set of specific instructions is provided below to execute AR-PRED.

A. Software

Download and install the following software

Perl version 5 (https://www.activestate.com/products/activeperl/downloads/)
MatLab 2017 or later (https://www.mathworks.com/downloads/)
Naccess (http://wolf.bms.umist.ac.uk/naccess/)
Fpocket version 2 (http://fpocket.sourceforge.net/)
DSSP (https://swift.cmbi.umcn.nl/gv/dssp/)
CD-HIT (http://weizhongli-lab.org/cd-hit/download.php)
BLAST 2.7 and nr protein sequence database (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.7.1/)
Rate4Site (https://m.tau.ac.il/~itaymay/cp/rate4site.html)
Clustal Omega (http://www.clustal.org/omega/). Rename the executable to clustalomega.
Download and unzip the MAVEN source code (MAVEN_source.v121.zip) from https://sourceforge.net/projects/maven/files/Release%201.21/. Store all the source code files in a directory named ‘MAVEN’ (rename the default directory after unzipping from MAVEN_source.v121 to MAVEN).

B. Environment variables

Download the AR-PRED source code and unzip to a local directory on your Linux work station. Make sure all the AR-PRED code is under the directory named ‘AR-PRED_source’ (by default when you unzip, it should be unzipped into a folder with that name). Copy the MAVEN directory to the AR-PRED_source directory. Then set the required environment variables and paths using the following steps.

Open the .bashrc file in your home directory using any text editor and add the following lines to the file after excluding the quotes and any special instructions given in parenthesis.

export ARPRED_HOME='path to AR-PRED_source directory'

export MATLAB_PATH='path to matlab executable' (the binary/executable matlab file must be included in the path)

export PATH=$PATH:'path to BLAST 2.7 bin directory

export BLASTDB='path to BLAST nr database'

export NACESS_PATH='path to NACCESS executable/binary'

export PATH=$PATH:'path to naccess binary' (the binary file should be named as naccess)

export PATH=$PATH:'path to rate4site binary' (the binary executable should be named as rate4site)

export PATH=$PATH:'path to cd-hit binary' (the default cd-hit binary is named as cd-hit and it should be left unchanged)

export PATH=$PATH:'path to fpocket binary' (the default binary name is fpocket and it should be left unchanged)

export PATH=$PATH:'path to dssp' (the binary should be named as dssp)

C. Compiling MAVEN *.c codes with mex

We will need to compile the MAVEN *.c codes for generating Hessian and adjacency matrices before running AR-PRED. Here are the instructions.

a. Start Matlab
b. Within Matlab, navigate to the MAVEN directory inside the ARPRED_HOME path (as set above)
c. To compile the \*.c codes, we will need to set up mex (if it is not already installed). Type “doc mex” on the matlab command line for instructions on how to set up mex
d. Compile the c codes with the “mex adjacency.c” and “mex ANMHess.c”

D. Running AR-PRED

AR-PRED requires an all-atom PDB file as its input. The user should make sure that non-standard or modified amino acids (other than the 20 standard amino acids) are be replaced by their standard names in the PDB file before executing AR-PRED. Otherwise, AR-PRED will report error.

AR-PRED has separate prediction models for active site and allosteric site residues in a given protein chain. Following are the commands and instructions to run predictions with AR-Pred.

$ARPRED_HOME/master_script.pl pdbfile  chain outputdir  pred_type activesite_file`

-pdbfile: Name of the all-atom PDB file. Must be present inside the output directory.
-chain: Name of the chain in the PDB file to make predictions for
-outputdir: Name of the output directory. Must be created before running the master script.
-pred_type: 1 for active site predictions and 2 for allosteric site predictions
-activesite_file: Comma separated values (csv) file containing residue IDs in the PDB forming the active site. 
	
The residue IDs must be separated by commas and must correspond to the residue IDs in the PDB file. The file name must include the path to the file.

For predicting active site residues change directory into the `outputdir` (`cd outputdir`) and run the following command.

$ARPRED_HOME/master_script.pl pdbfile  chain outputdir  1

For predicting regulatory or allosteric site residues however, AR-PRED requires prior knowledge of active site residues. If you already know the residues which form the active site (or a part of the active site) in the chain of interest, then write the residue ids as comma separated values into a .csv (e.g., activesite_res.csv) file and pass the file as an argument to the master_script. In case of no prior knowledge of active site residues, the AR-PRED active site prediction tool or any other active site prediction software may be used. Use the following command to predict allosteric site residues.

$ARPRED_HOME/master_script.pl pdbfile  chain outputdir  2 activesite_res.csv

Tip: When running the master script from the output directory, make sure to include the path alongwith the dir name in the `outputdir` argument.

E. AR-PRED output

AR-PRED outputs its predictions into a .csv file: allostericsite_predictions.csv and activesite_predictions.csv for the allosteric and active site residues, respectively. Each file has two columns: column 1 lists the residue IDs in the PDB file and column 2 lists the weighted probability scores corresponding to each residue ID. Residues having higher scores are more likely to be active or allosteric site residues, depending upon the type of prediction made.

Note: Depending on the database size and the number of sequences used for calculation of residue conservation scores and due to the random sampling of perturbation points for calculations of dynamic flexibility and perturbation response, there may be minor inconsistency in the residue ranking and scoring by AR-PRED in independent runs of the same protein. However, we have usually observed that the ranking of the top 50 highly probably active or allosteric site residues usually remain the same. To obtain more consistent results, it is advisable to get the average predictions from independent runs of AR-PRED.

In case of any questions or suggestions, please feel free to contact Sambit Mishra (skmishra@iastate.edu). Please include the citation (Mishra, S. K., Kandoi, G., & Jernigan, R. L. (2019). Coupling Dynamics and Evolutionary Information with Structure to Identify Protein Regulatory and Functional Binding Sites. Proteins: Structure, Function, and Bioinformatics.) if you use AR-PRED for your work.

F. License

AR-PRED is licensed under LGPL.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
Residue_PDB_Index		Residue_PDB_Index
Shortest_Path_2_Activesite_Residues		Shortest_Path_2_Activesite_Residues
dfi_and_allosteric_response		dfi_and_allosteric_response
example		example
example_2		example_2
example_3		example_3
example_4		example_4
example_5		example_5
network_features		network_features
parse_activesite_file		parse_activesite_file
pdb_file_processing_scripts		pdb_file_processing_scripts
pocket_residues		pocket_residues
residue_msf		residue_msf
residueconservation		residueconservation
rf_models_active_site		rf_models_active_site
rf_models_allosteric_site		rf_models_allosteric_site
secondary-structure		secondary-structure
solvent_accessibility		solvent_accessibility
.gitattributes		.gitattributes
AminoAcidIdentity.m		AminoAcidIdentity.m
GetAAType.m		GetAAType.m
GetSequenceHydropathy.m		GetSequenceHydropathy.m
LGPL - license.txt		LGPL - license.txt
PredictActiveSiteResidues.m		PredictActiveSiteResidues.m
PredictAllostericSiteResidues.m		PredictAllostericSiteResidues.m
Readme.md		Readme.md
Running AR-PRED.txt		Running AR-PRED.txt
_config.yml		_config.yml
master_script.pl		master_script.pl
matlab_feature_calc_wrapper.m		matlab_feature_calc_wrapper.m

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Active and Regulatory site PREDiction (AR-PRED)

Workflow

Instructions for Running AR-PRED

A. Software

B. Environment variables

C. Compiling MAVEN *.c codes with mex

D. Running AR-PRED

E. AR-PRED output

F. License

About

Releases

Packages

Languages

sambitmishra0628/AR-PRED_source

Folders and files

Latest commit

History

Repository files navigation

Active and Regulatory site PREDiction (AR-PRED)

Workflow

Instructions for Running AR-PRED

A. Software

B. Environment variables

C. Compiling MAVEN *.c codes with mex

D. Running AR-PRED

E. AR-PRED output

F. License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages