<span style="font-size:18px"> DeepPTMPred: A Multi-Modal Deep Learning Framework for Accurate Prediction of Protein Post-Translational Modification Sites </span>

> This notebook provides a complete workflow for predicting protein PTM sites, integrating sequence information, ESM language model features, and protein structural features.  
> It supports input from PDB files, automatically extracts sequence, structure, and ESM features, and outputs the phosphorylation probability for each residue.

## 🧰 Environment Setup and Dependency Installation

Install Required Dependencies:

```bash
git clone https://github.com/wanglabhku/DeepPTMPred
conda create -n ptm-env python=3.10 -y
conda activate ptm-env
conda install -c conda-forge cudatoolkit=11.8 cudnn
conda install -c pytorch pytorch torchvision torchaudio
pip install tensorflow==2.15
pip install tensorflow-addons
pip install biopython
pip install fair-esm
pip install scikit-learn==1.6.1
pip install imbalanced-learn==0.13.0
pip install matplotlib==3.10.3
pip install seaborn==0.13.2
pip install tqdm==4.67.1
pip install joblib==1.4.2
pip install logomaker==0.8.7
pip install pyrosetta-installer
python -c 'import pyrosetta_installer; pyrosetta_installer.install_pyrosetta()'

The file directory structure should be as follows:

/DeepPTMPred/
├── data/                          
│   ├── AF-P31749-F1-model_v4.pdb  
├── pred/
│   ├── train_PTM/                
│   │   ├── model/
│   │   │   └── models_phosphorylation_esm2/
│   │   │       └── ptm_data_210_39_64_best_model.h5 
│   │   ├── predict.ipynb 
│   │   ├── predict.py  
│   └── custom_esm/               
│       ├── P31749_full_esm.npz    
├── results/                        

> ✅ Once you’ve completed the steps above, you can directly run this notebook to predict.

In [1]:
# Cell 1: Set the path and initialize the system
print("Initializing DeepPTMPred prediction system...")
import os
import sys
import pandas as pd
import tensorflow as tf
from tensorflow.keras import backend as K 

# --- Dynamic Path Determination ---
# Get the current working directory of the notebook.
# This assumes the notebook is launched from its directory:
# /some/path/DeepPTMPred/pred/train_PTM/
current_notebook_dir = os.getcwd() 

# Determine the DeepPTMPred project root by going up two levels from the notebook's directory
# From /pred/train_PTM -> /pred -> /DeepPTMPred
project_root = os.path.abspath(os.path.join(current_notebook_dir, '..', '..'))

print(f"Project root determined as: {project_root}")

# Add the 'pred/train_PTM' directory to sys.path using the determined root
sys.path.append(os.path.join(project_root, 'pred', 'train_PTM'))

# Import the full prediction module
try:
    from predict import (
        PredictConfig,
        PTMPredictor,
        extract_protein_id_from_pdb_path,
        extract_sequence_from_pdb
    )
    print("Successfully loaded prediction module")
except ImportError as e:
    print(f"Failed to load prediction module: {str(e)}")
    print(f"Please ensure the path '{os.path.join(project_root, 'pred', 'train_PTM')}' is correct and 'predict.py' exists within it.")
    # Exit or raise an error if critical module can't be loaded
    # sys.exit(1) 

print("System initialization complete!")

# Make project_root available for Cell 2
%store project_root 

Initializing DeepPTMPred prediction system...


2025-10-22 17:25:33.987976: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-10-22 17:25:34.026321: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-10-22 17:25:34.026347: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-10-22 17:25:34.027561: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-10-22 17:25:34.035173: I tensorflow/core/platform/cpu_feature_guar

Project root determined as: /home/qhjiang/DeepPTMPred



TensorFlow Addons (TFA) has ended development and introduction of new features.
TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 

For more information see: https://github.com/tensorflow/addons/issues/2807 



DeepPTMPred project root determined as: /home/qhjiang/DeepPTMPred
Successfully loaded prediction module
System initialization complete!
Stored 'project_root' (str)


In [2]:
# Cell 2: Create the Interactive Prediction System
# Retrieve project_root from Cell 1
%store -r project_root

class MultiPTMPredictor:
    def __init__(self):
        self.ptm_types = [
            'phosphorylation', 'ubiquitination','acetylation', 'hydroxylation',
            'gamma_carboxyglutamic_acid', 'lys_methylation', 'malonylation', 
            'arg_methylation', 'crotonylation', 'succinylation', 'glutathionylation',
            'sumoylation', 's_nitrosylation', 'glutarylation', 'citrullination',
            'o_linked_glycosylation', 'n_linked_glycosylation'
        ]
        
        self.ptm_descriptions = {
            'phosphorylation': 'Phosphorylation (S, T)',
            'ubiquitination': 'Ubiquitination (K)',
            'acetylation': 'Acetylation (K)', 
            'hydroxylation': 'Hydroxylation (P)',
            'gamma_carboxyglutamic_acid': 'γ-Carboxyglutamic Acid (E)',
            'lys_methylation': 'Lysine Methylation (K)',
            'malonylation': 'Malonylation (K)',
            'arg_methylation': 'Arginine Methylation (R)',
            'crotonylation': 'Crotonylation (K)',
            'succinylation': 'Succinylation (K)',
            'glutathionylation': 'Glutathionylation (C)',
            'sumoylation': 'SUMOylation (K)',
            's_nitrosylation': 'S-Nitrosylation (C)',
            'glutarylation': 'Glutarylation (K)',
            'citrullination': 'Citrullination (R)',
            'o_linked_glycosylation': 'O-linked Glycosylation (S, T)',
            'n_linked_glycosylation': 'N-linked Glycosylation (N)'
        }
    
    def get_user_input(self):
        """Get user input"""
        print("\n" + "="*50)
        print("DeepPTMPred - Multi-PTM Type Prediction")
        print("="*50)
        
        # Display PTM type options
        print("\nPlease select a PTM type:")
        for i, ptm_type in enumerate(self.ptm_types, 1):
            print(f"{i:2d}. {self.ptm_descriptions[ptm_type]}")
        
        # PTM type selection
        while True:
            try:
                choice = input(f"\nSelect PTM type (1-{len(self.ptm_types)}, default 1): ").strip()
                if not choice:
                    choice = 1
                else:
                    choice = int(choice)
                
                if 1 <= choice <= len(self.ptm_types):
                    ptm_type = self.ptm_types[choice-1]
                    break
                else:
                    print(f"Please enter a number between 1 and {len(self.ptm_types)}")
            except ValueError:
                print("Please enter a valid number")
        
        # Protein ID input
        protein_id = input("\nEnter protein ID (e.g., P31749):  ").strip()
        while not protein_id:
            protein_id = input("Protein ID cannot be empty, please re-enter: ").strip()
        
        # Sites of interest
        sites_input = input("Sites of interest (comma-separated, e.g., 129,308,473; press Enter to skip):").strip()
        sites_of_interest = []
        if sites_input:
            try:
                sites_of_interest = [int(x.strip()) for x in sites_input.split(',')]
            except ValueError:
                print("Invalid format for sites, skipping special attention")
        
        return ptm_type, protein_id, sites_of_interest
    
    def validate_files(self, protein_id, ptm_type):
        """Validate required files"""
        # Construct paths relative to project_root
        pdb_path = os.path.join(project_root, 'data', f'AF-{protein_id}-F1-model_v4.pdb')
        esm_path = os.path.join(project_root, 'pred', 'custom_esm', f'{protein_id}_full_esm.npz')
        
        print("\nChecking required files...")
        for path, name in [(pdb_path, "PDB file"), (esm_path, "ESM feature file")]:
            if not os.path.exists(path):
                raise FileNotFoundError(f"{name}not found: {path}")
            print(f"✓ {name}: {os.path.basename(path)}")
        
        return pdb_path
    
    def run_prediction(self, ptm_type, protein_id, pdb_path):
        """Run prediction"""
        print(f"\nStarting prediction for {self.ptm_descriptions[ptm_type]} ...")
        
        # Create config and predictor
        config = PredictConfig(ptm_type=ptm_type, project_root=project_root)
        predictor = PTMPredictor(config)
        
        # Extract sequence
        protein_sequence = extract_sequence_from_pdb(pdb_path, chain_id="A")
        print(f"Sequence length: {len(protein_sequence)}")
        
        # Find target amino acid positions
        target_aa = config.target_aa
        target_positions = [i+1 for i, aa in enumerate(protein_sequence) if aa in target_aa]
        print(f"Found {len(target_positions)} {''.join(target_aa)} positions")
        
        # Run prediction
        print("Running model prediction...")
        results_df = predictor.predict_ptm_sites(
            protein_id, protein_sequence, target_positions, pdb_path=pdb_path
        )
        
        print("Prediction complete!")
        return results_df, protein_sequence, target_aa
    
    def display_results(self, results_df, ptm_type, protein_id, protein_sequence, target_aa, sites_of_interest):
        """Display results"""
        print("\n" + "="*50)
        print(f"{protein_id} - {self.ptm_descriptions[ptm_type]} Prediction Results")
        print("="*50)
        
        total = len(results_df)
        positive = len(results_df[results_df['prediction'] == 1])
        high_conf = len(results_df[results_df['probability'] > 0.6])
        
        print(f"Target amino acids: {target_aa}")
        print(f"Total {''.join(target_aa)} Positions: {total}")
        print(f"Predicted {ptm_type}: {positive} ({positive/total*100:.1f}%)")
        print(f"High confidence (>0.6): {high_conf}")
        print(f"Highest probability {results_df['probability'].max():.3f}")
        
        # High-confidence sites
        if high_conf > 0:
            print(f"\nHigh-confidence sites:")
            high_sites = results_df[results_df['probability'] > 0.6].nlargest(8, 'probability')
            for _, row in high_sites.iterrows():
                print(f"  Position {row['position']:3d} ({row['residue']}): {row['probability']:.3f}")
        
        # Sites of interest
        if sites_of_interest:
            print(f"\nSites of interest:")
            for pos in sites_of_interest:
                site_data = results_df[results_df['position'] == pos]
                if not site_data.empty:
                    prob = site_data['probability'].values[0]
                    pred = "Yes" if site_data['prediction'].values[0] == 1 else "No"
                    print(f"  Position {pos:3d} ({protein_sequence[pos-1]}): Probability={prob:.3f}, Prediction={pred}")
                else:
                    print(f"  Positions {pos:3d}: Not a {''.join(target_aa)} residue")
        
        return total, positive
    
    def save_results(self, results_df, protein_id, ptm_type):
        """Save results"""
        output_dir = os.path.join(project_root, 'results')
        os.makedirs(output_dir, exist_ok=True)
        
        output_path = f"{output_dir}/{protein_id}_{ptm_type}_predictions.csv"
        results_df.to_csv(output_path, index=False)
        print(f"\nResults saved to: {output_path}")
        return output_path
    
    def start_prediction(self):
        """Start prediction workflow"""
        try:
            ptm_type, protein_id, sites_of_interest = self.get_user_input()
            pdb_path = self.validate_files(protein_id, ptm_type)
            print(f"\nParameters confirmed:")
            print(f"  PTM type: {self.ptm_descriptions[ptm_type]}")
            print(f"  Protein: {protein_id}")
            if sites_of_interest:
                print(f"  Sites of interest: {sites_of_interest}")
            
            confirm = input("\nStart prediction? (y/n): ").strip().lower()
            if confirm != 'y':
                print("Prediction canceled")
                return None
            results_df, protein_sequence, target_aa = self.run_prediction(ptm_type, protein_id, pdb_path)

            total, positive = self.display_results(results_df, ptm_type, protein_id, protein_sequence, target_aa, sites_of_interest)

            output_path = self.save_results(results_df, protein_id, ptm_type)
            
            print(f"\n✓ Prediction completed! Analyzed {total}{''.join(target_aa)}positions, predicted {positive}个{ptm_type} sites")
            return results_df
            
        except Exception as e:
            print(f"\n✗ error: {str(e)}")
            return None

print("Creating the prediction system...")
interactive_predictor = MultiPTMPredictor()
print("Prediction system is ready!")

Creating the prediction system...
Prediction system is ready!


In [3]:
# # Cell 3: Start Prediction
print("PTM Site Prediction System")
print("=" * 50)
print("Welcome to the DeepPTMPred Prediction System!")
print()
print("Steps to use:")
print("1. Make sure the PDB file and ESM feature file are prepared")
print("2. Follow the prompts to enter the protein ID and other information")
print("3. The system will automatically perform the prediction and display the results")
print("4. Results will be automatically saved to a CSV file")
print()

# Start prediction
results = interactive_predictor.start_prediction()

if results is not None:
    print("Prediction completed!")
else:
    print("Prediction not completed, please check your input or files")

PTM Site Prediction System
Welcome to the DeepPTMPred Prediction System!

Steps to use:
1. Make sure the PDB file and ESM feature file are prepared
2. Follow the prompts to enter the protein ID and other information
3. The system will automatically perform the prediction and display the results
4. Results will be automatically saved to a CSV file


DeepPTMPred - Multi-PTM Type Prediction

Please select a PTM type:
 1. Phosphorylation (S, T)
 2. Ubiquitination (K)
 3. Acetylation (K)
 4. Hydroxylation (P)
 5. γ-Carboxyglutamic Acid (E)
 6. Lysine Methylation (K)
 7. Malonylation (K)
 8. Arginine Methylation (R)
 9. Crotonylation (K)
10. Succinylation (K)
11. Glutathionylation (C)
12. SUMOylation (K)
13. S-Nitrosylation (C)
14. Glutarylation (K)
15. Citrullination (R)
16. O-linked Glycosylation (S, T)
17. N-linked Glycosylation (N)

Checking required files...
✓ PDB file: AF-P31749-F1-model_v4.pdb
✓ ESM feature file: P31749_full_esm.npz

Parameters confirmed:
  PTM type: Phosphorylation (

2025-10-22 17:26:03.709283: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2256] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...


Sequence length: 480
Found 53 ST positions
Running model prediction...

=== PyRosetta Initialization ===
┌───────────────────────────────────────────────────────────────────────────────┐
│                                  PyRosetta-4                                  │
│               Created in JHU by Sergey Lyskov and PyRosetta Team              │
│               (C) Copyright Rosetta Commons Member Institutions               │
│                                                                               │
│ NOTE: USE OF PyRosetta FOR COMMERCIAL PURPOSES REQUIRES PURCHASE OF A LICENSE │
│          See LICENSE.PyRosetta.md or email license@uw.edu for details         │
└───────────────────────────────────────────────────────────────────────────────┘
PyRosetta-4 2025 [Rosetta PyRosetta4.Release.python310.ubuntu 2025.42+release.62f6a3258646b8927d40ecaca1860b284bd81c73 2025-10-10T13:25:42] retrieved from: http://www.pyrosetta.org
Successfully initialize PyRosetta

=== PDB File Validation

In [4]:
# Cell 4: Continue Prediction (Optional)
def continue_prediction():
    """Continue predicting other proteins"""
    while True:
        print("\n" + "="*50)
        continue_pred = input("Do you want to continue predicting other proteins? (y/n): ").strip().lower()
        
        if continue_pred == 'y':
            results = interactive_predictor.start_prediction()
            if results is None:
                print("Prediction failed, please check the issue and try again.")
        else:
            print("Thank you for using DeepPTMPred for prediction!")
            break

# Uncomment the line below to enable continuous prediction
continue_prediction()

print("\nTip: To predict other proteins, please rerun Cell 3")



DeepPTMPred - Multi-PTM Type Prediction

Please select a PTM type:
 1. Phosphorylation (S, T)
 2. Ubiquitination (K)
 3. Acetylation (K)
 4. Hydroxylation (P)
 5. γ-Carboxyglutamic Acid (E)
 6. Lysine Methylation (K)
 7. Malonylation (K)
 8. Arginine Methylation (R)
 9. Crotonylation (K)
10. Succinylation (K)
11. Glutathionylation (C)
12. SUMOylation (K)
13. S-Nitrosylation (C)
14. Glutarylation (K)
15. Citrullination (R)
16. O-linked Glycosylation (S, T)
17. N-linked Glycosylation (N)

Checking required files...
✓ PDB file: AF-P31749-F1-model_v4.pdb
✓ ESM feature file: P31749_full_esm.npz

Parameters confirmed:
  PTM type: Ubiquitination (K)
  Protein: P31749
  Sites of interest: [473]

Starting prediction for Ubiquitination (K) ...
Sequence length: 480
Found 36 K positions
Running model prediction...

=== PyRosetta Initialization ===
┌───────────────────────────────────────────────────────────────────────────────┐
│                                  PyRosetta-4                      