<span style="font-size:18px"># DeepPTMPred: 基于多模态的蛋白质PTM位点预测 </span>

> 本 Notebook 提供了完整的蛋白质PTM位点预测流程，结合了序列信息、ESM语言模型特征和蛋白质结构特征。


本系统融合以下多模态特征：

- **序列信息**：使用多尺度序列窗口捕捉局部模式  
- **ESM语言模型特征**：利用 ESM-2 预训练模型的语义表示  
- **结构特征**：通过 PyRosetta 计算 SASA、二面角、二级结构等几何信息  
- **深度学习架构**：CNN + Transformer + 注意力机制的混合模型  

---

### 模型核心组件

1. **多尺度卷积网络**：使用 `[21, 33, 51]` 三种窗口大小捕捉不同范围的上下文  
2. **Transformer 编码器**：建模长距离依赖关系  
3. **结构特征融合**：将 3D 几何特征与序列特征对齐融合  
4. **ESM 特征增强**：引入 ESM-2 的嵌入向量提升语义理解能力  
5. **注意力机制**：自动学习各特征的重要性权重  

> 💡 该系统支持从 PDB 文件输入，自动提取序列、结构和 ESM 特征，并输出每个残基的磷酸化概率。

## 🧰 环境准备与依赖安装

请按照以下步骤配置运行环境：

conda create -n ptm-env python=3.10 -y
conda activate ptm-env
conda install -c nvidia cudatoolkit=11.8 cudnn
conda install -c pytorch pytorch torchvision torchaudio
pip install tensorflow==2.15
pip install tensorflow-addons
pip install fair-esm
pip install scikit-learn==1.6.1
pip install imbalanced-learn==0.13.0
pip install matplotlib==3.10.3
pip install seaborn==0.13.2
pip install tqdm==4.67.1
pip install joblib==1.4.2
pip install logomaker==0.8.7
安装PyRosetta：
pip install pyrosetta-installer
python -c 'import pyrosetta_installer; pyrosetta_installer.install_pyrosetta()'

当你安装PyRosetta后，你的文件目录结构要如下：

/DeepPTMPred/
├── data/                          # 原始数据目录
│   ├── AF-P31749-F1-model_v4.pdb  # 示例PDB文件
│   └── [其他PDB文件]
├── pred/
│   ├── train_PTM/                 # 训练相关文件
│   │   ├── model/
│   │   │   └── models_phosphorylation_esm2/
│   │   │       └── ptm_data_210_39_64_best_model.h5  # 训练模型权重文件
│   │   ├── predict.ipynb  # 本Notebook文件
│   │   └── [其他训练脚本和日志]
│   └── custom_esm/                # ESM特征文件
│       ├── P31749_full_esm.npz    # case1：ESM特征
│       └── [其他蛋白质ESM特征]
├── results/                       # 预测结果输出目录
│     └── [预测结果CSV文件]

> ✅ 当你完成以上工作后，就可以直接运行此 notebook 进行蛋白质预测了。

In [1]:
# 单元格 1: 修正路径并初始化系统
print("正在初始化 DeepPTMPred 预测系统...")
import os
import sys
import pandas as pd
import tensorflow as tf
from tensorflow.keras import backend as K 
sys.path.append('/root/autodl-tmp/DeepPTMPred/pred/train_PTM')

# 导入完整预测模块
try:
    from predict import (
        PredictConfig,
        PTMPredictor,
        extract_protein_id_from_pdb_path,
        extract_sequence_from_pdb
    )
    print("成功加载预测模块")
except ImportError as e:
    print(f" 加载失败: {str(e)}")
    print("请确保 predict.py 文件路径正确")

class FixedPredictConfig(PredictConfig):
    def __init__(self):
        # 调用父类的初始化
        super().__init__()
        # 模型文件使用绝对路径
        self.model_path = f"/root/autodl-tmp/DeepPTMPred/pred/train_PTM/model/models_{ptm_type}_esm2/ptm_data_210_39_64_best_model.h5"
        # 验证路径是否正确
        print(f"模型文件路径: {self.model_path}")
        print(f"文件存在: {os.path.exists(self.model_path)}")
        if not os.path.exists(self.model_path):
            print("模型文件不存在！请检查路径")

print("系统初始化完成！")

正在初始化 DeepPTMPred 预测系统...


2025-10-19 13:36:47.413106: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-10-19 13:36:47.413138: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-10-19 13:36:47.414210: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-10-19 13:36:47.419939: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

TensorFlow Addons (TFA) has ended development and in

成功加载预测模块
系统初始化完成！


In [2]:
# 单元格 2: 交互式预测系统
class MultiPTMPredictor:
    def __init__(self):
        self.ptm_types = [
            'phosphorylation', 'ubiquitination','acetylation', 'hydroxylation',
            'gamma_carboxyglutamic_acid', 'lys_methylation', 'malonylation', 
            'arg_methylation', 'crotonylation', 'succinylation', 'glutathionylation',
            'sumoylation', 's_nitrosylation', 'glutarylation', 'citrullination',
            'o_linked_glycosylation', 'n_linked_glycosylation'
        ]
        
        self.ptm_descriptions = {
            'phosphorylation': '磷酸化 (S, T)',
            'ubiquitination': '泛素化 (K)',
            'acetylation': '乙酰化 (K)', 
            'hydroxylation': '羟基化 (P)',
            'gamma_carboxyglutamic_acid': 'γ-羧基谷氨酸 (E)',
            'lys_methylation': '赖氨酸甲基化 (K)',
            'malonylation': '丙二酰化 (K)',
            'arg_methylation': '精氨酸甲基化 (R)',
            'crotonylation': '巴豆酰化 (K)',
            'succinylation': '琥珀酰化 (K)',
            'glutathionylation': '谷胱甘肽化 (C)',
            'sumoylation': 'SUMO化 (K)',
            's_nitrosylation': 'S-亚硝基化 (C)',
            'glutarylation': '戊二酰化 (K)',
            'citrullination': '瓜氨酸化 (R)',
            'o_linked_glycosylation': 'O-连接糖基化 (S, T)',
            'n_linked_glycosylation': 'N-连接糖基化 (N)'
        }
    
    def get_user_input(self):
        """获取用户输入"""
        print("\n" + "="*50)
        print("DeepPTMPred - 多PTM类型预测")
        print("="*50)
        
        # 显示PTM类型选择
        print("\n请选择PTM类型:")
        for i, ptm_type in enumerate(self.ptm_types, 1):
            print(f"{i:2d}. {self.ptm_descriptions[ptm_type]}")
        
        # PTM类型选择
        while True:
            try:
                choice = input(f"\n请选择PTM类型 (1-{len(self.ptm_types)}, 默认 1): ").strip()
                if not choice:
                    choice = 1
                else:
                    choice = int(choice)
                
                if 1 <= choice <= len(self.ptm_types):
                    ptm_type = self.ptm_types[choice-1]
                    break
                else:
                    print(f"请输入 1-{len(self.ptm_types)} 之间的数字")
            except ValueError:
                print("请输入有效数字")
        
        # 蛋白质ID输入
        protein_id = input("\n请输入蛋白质ID (如 P31749): ").strip()
        while not protein_id:
            protein_id = input("蛋白质ID不能为空，请重新输入: ").strip()
        
        # 特别关注位点
        sites_input = input("特别关注位点 (用逗号分隔，如 129,308,473，回车跳过): ").strip()
        sites_of_interest = []
        if sites_input:
            try:
                sites_of_interest = [int(x.strip()) for x in sites_input.split(',')]
            except ValueError:
                print("位点格式错误，将跳过特别关注")
        
        return ptm_type, protein_id, sites_of_interest
    
    def validate_files(self, protein_id, ptm_type):
        """验证文件是否存在"""
        pdb_path = f"/root/autodl-tmp/DeepPTMPred/data/AF-{protein_id}-F1-model_v4.pdb"
        esm_path = f"/root/autodl-tmp/DeepPTMPred/pred/custom_esm/{protein_id}_full_esm.npz"
        
        print("\n检查必需文件...")
        for path, name in [(pdb_path, "PDB文件"), (esm_path, "ESM特征文件")]:
            if not os.path.exists(path):
                raise FileNotFoundError(f"{name}不存在: {path}")
            print(f"✓ {name}: {os.path.basename(path)}")
        
        return pdb_path
    
    def run_prediction(self, ptm_type, protein_id, pdb_path):
        """运行预测"""
        print(f"\n开始 {self.ptm_descriptions[ptm_type]} 预测...")
        
        # 创建配置和预测器
        config = PredictConfig(ptm_type=ptm_type)
        predictor = PTMPredictor(config)
        
        # 提取序列
        protein_sequence = extract_sequence_from_pdb(pdb_path, chain_id="A")
        print(f"序列长度: {len(protein_sequence)}")
        
        # 找出目标氨基酸位点
        target_aa = config.target_aa
        target_positions = [i+1 for i, aa in enumerate(protein_sequence) if aa in target_aa]
        print(f"找到 {len(target_positions)} 个{''.join(target_aa)}位点")
        
        # 运行预测
        print("运行模型预测...")
        results_df = predictor.predict_ptm_sites(
            protein_id, protein_sequence, target_positions, pdb_path=pdb_path
        )
        
        print("预测完成!")
        return results_df, protein_sequence, target_aa
    
    def display_results(self, results_df, ptm_type, protein_id, protein_sequence, target_aa, sites_of_interest):
        """显示结果"""
        print("\n" + "="*50)
        print(f"{protein_id} - {self.ptm_descriptions[ptm_type]} 预测结果")
        print("="*50)
        
        total = len(results_df)
        positive = len(results_df[results_df['prediction'] == 1])
        high_conf = len(results_df[results_df['probability'] > 0.6])
        
        print(f"目标氨基酸: {target_aa}")
        print(f"总{''.join(target_aa)}位点: {total}")
        print(f"预测{ptm_type}: {positive} ({positive/total*100:.1f}%)")
        print(f"高置信度 (>0.6): {high_conf}")
        print(f"最高概率: {results_df['probability'].max():.3f}")
        
        # 高概率位点
        if high_conf > 0:
            print(f"\n高置信度位点:")
            high_sites = results_df[results_df['probability'] > 0.6].nlargest(8, 'probability')
            for _, row in high_sites.iterrows():
                print(f"  位置 {row['position']:3d} ({row['residue']}): {row['probability']:.3f}")
        
        # 特别关注位点
        if sites_of_interest:
            print(f"\n特别关注位点:")
            for pos in sites_of_interest:
                site_data = results_df[results_df['position'] == pos]
                if not site_data.empty:
                    prob = site_data['probability'].values[0]
                    pred = "是" if site_data['prediction'].values[0] == 1 else "否"
                    print(f"  位置 {pos:3d} ({protein_sequence[pos-1]}): 概率={prob:.3f}, 预测={pred}")
                else:
                    print(f"  位置 {pos:3d}: 非{''.join(target_aa)}残基")
        
        return total, positive
    
    def save_results(self, results_df, protein_id, ptm_type):
        """保存结果"""
        output_dir = "/root/autodl-tmp/DeepPTMPred/results"
        os.makedirs(output_dir, exist_ok=True)
        
        output_path = f"{output_dir}/{protein_id}_{ptm_type}_predictions.csv"
        results_df.to_csv(output_path, index=False)
        print(f"\n结果保存至: {output_path}")
        return output_path
    
    def start_prediction(self):
        """启动预测流程"""
        try:
            ptm_type, protein_id, sites_of_interest = self.get_user_input()
            pdb_path = self.validate_files(protein_id, ptm_type)
            print(f"\n参数确认:")
            print(f"  PTM类型: {self.ptm_descriptions[ptm_type]}")
            print(f"  蛋白质: {protein_id}")
            if sites_of_interest:
                print(f"  关注位点: {sites_of_interest}")
            
            confirm = input("\n开始预测? (y/n): ").strip().lower()
            if confirm != 'y':
                print("预测取消")
                return None
            results_df, protein_sequence, target_aa = self.run_prediction(ptm_type, protein_id, pdb_path)

            total, positive = self.display_results(results_df, ptm_type, protein_id, protein_sequence, target_aa, sites_of_interest)

            output_path = self.save_results(results_df, protein_id, ptm_type)
            
            print(f"\n✓ 预测完成! 共分析{total}个{''.join(target_aa)}位点，预测{positive}个{ptm_type}位点")
            return results_df
            
        except Exception as e:
            print(f"\n✗ 错误: {str(e)}")
            return None

print("正在创建预测系统...")
interactive_predictor = MultiPTMPredictor()
print("预测系统准备就绪！")

正在创建交互式预测系统...
交互式预测系统准备就绪！


In [3]:
# 单元格 3: 开始预测
print("PTM位点预测系统")
print("=" * 50)
print("欢迎使用 DeepPTMPred 预测系统！")
print()
print("使用步骤:")
print("1. 确保PDB文件和ESM特征文件已准备")
print("2. 按照提示输入蛋白质ID等信息")
print("3. 系统会自动完成预测并显示结果")
print("4. 结果会自动保存到CSV文件")
print()

# 启动预测
results = interactive_predictor.start_prediction()

if results is not None:
    print("预测完成！")
else:
    print("预测未完成，请检查输入或文件")

PTM位点预测系统
欢迎使用 DeepPTMPred 预测系统！

使用步骤:
1. 确保PDB文件和ESM特征文件已准备
2. 按照提示输入蛋白质ID等信息
3. 系统会自动完成预测并显示结果
4. 结果会自动保存到CSV文件


DeepPTMPred - 多PTM类型预测

请选择PTM类型:
 1. 磷酸化 (S, T)
 2. 泛素化 (K)
 3. 乙酰化 (K)
 4. 羟基化 (P)
 5. γ-羧基谷氨酸 (E)
 6. 赖氨酸甲基化 (K)
 7. 丙二酰化 (K)
 8. 精氨酸甲基化 (R)
 9. 巴豆酰化 (K)
10. 琥珀酰化 (K)
11. 谷胱甘肽化 (C)
12. SUMO化 (K)
13. S-亚硝基化 (C)
14. 戊二酰化 (K)
15. 瓜氨酸化 (R)
16. O-连接糖基化 (S, T)
17. N-连接糖基化 (N)



请选择PTM类型 (1-17, 默认 1):  1

请输入蛋白质ID (如 P31749):  P31749
特别关注位点 (用逗号分隔，如 129,308,473，回车跳过):  308,473



检查必需文件...
✓ PDB文件: AF-P31749-F1-model_v4.pdb
✓ ESM特征文件: P31749_full_esm.npz

参数确认:
  PTM类型: 磷酸化 (S, T)
  蛋白质: P31749
  关注位点: [308, 473]



开始预测? (y/n):  Y



开始 磷酸化 (S, T) 预测...
序列长度: 480
找到 53 个ST位点
运行模型预测...

=== PyRosetta初始化调试 ===
┌───────────────────────────────────────────────────────────────────────────────┐
│                                  PyRosetta-4                                  │
│               Created in JHU by Sergey Lyskov and PyRosetta Team              │
│               (C) Copyright Rosetta Commons Member Institutions               │
│                                                                               │
│ NOTE: USE OF PyRosetta FOR COMMERCIAL PURPOSES REQUIRES PURCHASE OF A LICENSE │
│          See LICENSE.PyRosetta.md or email license@uw.edu for details         │
└───────────────────────────────────────────────────────────────────────────────┘
PyRosetta-4 2025 [Rosetta PyRosetta4.Release.python310.ubuntu 2025.25+release.a0cefad01b3959ae8327a8931f5ad8c3fad27ea9 2025-06-18T10:51:52] retrieved from: http://www.pyrosetta.org
PyRosetta初始化成功

=== PDB文件验证 ===
文件路径: /root/autodl-tmp/DeepPTMPred/data/AF-P31749-F1-m

In [None]:
# 单元格 4: 继续预测（可选）
def continue_prediction():
    """继续预测其他蛋白质"""
    while True:
        print("\n" + "="*50)
        continue_pred = input("是否要继续预测其他蛋白质? (y/n): ").strip().lower()
        
        if continue_pred == 'y':
            results = interactive_predictor.start_prediction()
            if results is None:
                print("预测失败，请检查问题后重试")
        else:
            print("感谢使用 DeepPTMPred 预测！")
            break

# 取消注释下面一行来启用连续预测功能
continue_prediction()

print("\n提示: 如需预测其他蛋白质，请重新运行单元格3")




是否要继续预测其他蛋白质? (y/n):  Y



DeepPTMPred - 多PTM类型预测

请选择PTM类型:
 1. 磷酸化 (S, T)
 2. 泛素化 (K)
 3. 乙酰化 (K)
 4. 羟基化 (P)
 5. γ-羧基谷氨酸 (E)
 6. 赖氨酸甲基化 (K)
 7. 丙二酰化 (K)
 8. 精氨酸甲基化 (R)
 9. 巴豆酰化 (K)
10. 琥珀酰化 (K)
11. 谷胱甘肽化 (C)
12. SUMO化 (K)
13. S-亚硝基化 (C)
14. 戊二酰化 (K)
15. 瓜氨酸化 (R)
16. O-连接糖基化 (S, T)
17. N-连接糖基化 (N)



请选择PTM类型 (1-17, 默认 1):  2

请输入蛋白质ID (如 P31749):  P31749
特别关注位点 (用逗号分隔，如 129,308,473，回车跳过):  



检查必需文件...
✓ PDB文件: AF-P31749-F1-model_v4.pdb
✓ ESM特征文件: P31749_full_esm.npz

参数确认:
  PTM类型: 泛素化 (K)
  蛋白质: P31749



开始预测? (y/n):  Y



开始 泛素化 (K) 预测...
序列长度: 480
找到 36 个K位点
运行模型预测...

=== PyRosetta初始化调试 ===
┌───────────────────────────────────────────────────────────────────────────────┐
│                                  PyRosetta-4                                  │
│               Created in JHU by Sergey Lyskov and PyRosetta Team              │
│               (C) Copyright Rosetta Commons Member Institutions               │
│                                                                               │
│ NOTE: USE OF PyRosetta FOR COMMERCIAL PURPOSES REQUIRES PURCHASE OF A LICENSE │
│          See LICENSE.PyRosetta.md or email license@uw.edu for details         │
└───────────────────────────────────────────────────────────────────────────────┘
PyRosetta-4 2025 [Rosetta PyRosetta4.Release.python310.ubuntu 2025.25+release.a0cefad01b3959ae8327a8931f5ad8c3fad27ea9 2025-06-18T10:51:52] retrieved from: http://www.pyrosetta.org
PyRosetta初始化成功

=== PDB文件验证 ===
文件路径: /root/autodl-tmp/DeepPTMPred/data/AF-P31749-F1-model