GraphKM: machine and deep learning for KM prediction of wildtype and mutant enzymes

Introduction

The GraphKM toolbox is a Python package for prediction of KMs.

Requirements

Assuming that you use Miniconda or Anaconda. In a terminal execute:

conda env create -n GraphKM python=3.8
conda activate GraphKM

Requirement packages:

paddlehelix==1.0.1
pgl==2.2.4
paddlepaddle-gpu==2.3.2
matplotlib
scikit-learn
rdkit
PubChemPy
xgboost==1.7.5
hyperopt==0.2.7
ESM

Note: paddlepaddle-gpu==2.3.2 is installed by command line conda install paddlepaddle-gpu==2.3.2 cudatoolkit=11.2 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge.

Please refer to this github site for ESM installation.

Input files

Before data preprocessing, a json file and a csv file should be ready. The json file and the csv file is generated by KM_data_clean/generate_esm_vector_gpu.py. Run following codes:

python generate_esm_vector_gpu.py -i my_data.json -o sequences_embeddings.csv

Train

Preprocess

python data_preprocess.py -i my_data.json -l KM -input_seq my_protein_sequences_embeddings.csv -o my_dataset.npz

Training

The training needs big memory if you use GPU for acceleration. Suggestion that the memory of your GPU is 24 GB.

python train.py -d path_to/my_dataset.npz --model_config path_to/gin_config.json -l KM -- model_dir path_to/ --results_dir path_to/

python train_xgb.py -i path_to/my_data.json -l KM -input_seq path_to/my_protein_sequences_embeddings.csv -m path_to/best_model_gin_-1_lr0.0005.pdparams --model_config path_to/gin_config.json

Training results

Methods	MSE	r.m.s.e.	R2
GIN-based	0.639	0.799	0.614
GAT-based	0.709	0.842	0.572
GCN-based	0.671	0.819	0.595
GAT_GCN-based	0.627	0.792	0.622

Note: The trained models are available in the Figshare database with DOI: 10.6084/m9.figshare.25335049.

Prediction

The input for prediction.py:

If you want to predict KM values of different seuqences corresponding to different substrate SMILES codes, use csv file as input. The format of csv file please refer to the example.csv file. The commond line example for prediction:
```
python prediction.py -c --csv_file example.csv -l KM -input_seq example.tsv -m path_to/best_model_gin_-1_lr0.0005.pdparams --model_config gin_config.json -xgb path_to/gin_xgboost_model.dat
```

If you want to predict KM values of different seuqences corresponding to one type substrate SMILES codes, use FASTA file as input.

commond line example for prediction:

python prediction.py -l KM -f --fasta_file example.fasta -input_seq my_sequences_embeddings.tsv -S substrate.txt -m path_to/best_model_gin_-1_lr0.0005.pdparams --model_config path_to/gin_config.json -xgb path_to/gin_xgboost_model.dat

Independent dataset

We manually collected an independent KM dataset (HXKm) from literatures. The HXKm dataset had be published at this journal.

Tip

Enter -h tag for more helps.

python data_preprocess.py -h
python train.py -h
python train_xgb.py -h
python prediction.py -h

Citation

He, X., Yan, M. GraphKM: machine and deep learning for KM prediction of wildtype and mutant enzymes. BMC Bioinformatics 25, 135 (2024). https://doi.org/10.1186/s12859-024-05746-1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean_data_codes

Clean_data_codes

KM_data_clean

KM_data_clean

Train_codes

Train_codes

README.md

README.md

example.csv

example.csv

example.fasta

example.fasta

example.tsv

example.tsv

prediction.py

prediction.py

substrate.txt

substrate.txt

Repository files navigation

GraphKM: machine and deep learning for KM prediction of wildtype and mutant enzymes

Introduction

Requirements

Input files

Train

Preprocess

Training

Training results

Prediction

Independent dataset

Tip

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
Clean_data_codes		Clean_data_codes
KM_data_clean		KM_data_clean
Train_codes		Train_codes
README.md		README.md
example.csv		example.csv
example.fasta		example.fasta
example.tsv		example.tsv
prediction.py		prediction.py
substrate.txt		substrate.txt

realHXiao/GraphKM

Folders and files

Latest commit

History

Repository files navigation

GraphKM: machine and deep learning for KM prediction of wildtype and mutant enzymes

Introduction

Requirements

Input files

Train

Preprocess

Training

Training results

Prediction

Independent dataset

Tip

Citation

About

Resources

Stars

Watchers

Forks

Languages