Accurate and Fast Prediction of Intrinsically Disordered Protein by Multiple Protein Language Models and Ensemble Learning
This is the official implementation of the paper "Accurate and Fast Prediction of Intrinsically Disordered Protein by Multiple Protein Language Models and Ensemble Learning".
Our IDP-ELM server is now available! Access IDP-ELM
conda create -n idpelm python=3.10
conda activate idpelm
pip install -r requirements.txt
Please also download all trained checkpoints from here and decompress them into weights/
.
- It is strongly recommended to run the code on a machine with more than 70 GB RAM. Otherwise, you may encounter some memory issues.
- It is recommended to use
start_docker.sh
to create a docker container if you are using a Windows system.
To reproduce the results in the paper, please follow the steps below:
- Run
./train.sh
. - After running, you have the following options:
- [0] Before training models on datasets, you have to encode sequences into high-dimensional representations, to accelerate the training process.
- [1] Train secondary structure predictor, which is used as a module in IDP-ELM.
- [3, 7, 11] Train IDP-ELM and IDR function predictors.
- Other options are explained in the script.
- IDP datasets
- DFL datasets
- DP datasets
- From DeepDISOBind
- Training set: data/DeepDISOBind/TrainingDataset.txt
- Test set: data/DeepDISOBind/TestDataset.txt
- From DeepDISOBind
- Secondary structure datasets
- From NetSurfP-3.0
- Training set: data/NetSurfP-3.0/Train_HHblits.txt
- Test set:
- CASP12: data/NetSurfP-3.0/CASP12_HHblits.txt
- CB513: data/NetSurfP-3.0/CB513_HHblits.txt
- TS115: data/NetSurfP-3.0/TS115_HHblits.txt
- From NetSurfP-3.0
If you have any questions, please contact me at shijie.xu@ees.hokudai.ac.jp
.