Skip to content

young55775/ProtWord

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧬 ProtWord: A Discrete Language of Protein Words for Functional Discovery and Design

License: OpenRAIL-M Parameters DOI

Welcome to the official repository of ProtWord.

💡 The Idea: Why ProtWord?

Protein Language Models (PLMs) have pioneered the intersection of AI and biology, achieving remarkable breakthroughs by applying Natural Language Processing (NLP) techniques to amino acid sequences. Building upon these foundational works, ProtWord explores a new, physics-aware paradigm.

Amino acids are material entities constrained by local geometry and steric exclusion. Treating them purely as 1D continuous text can be computationally intensive and sometimes obscures these local physical constraints. ProtWord addresses this by introducing a hierarchical, discrete framework:

  1. Convolutional Bottleneck (U-Net style): We compress the sequence 4x, offloading local physical constraints to CNN kernels.
  2. Discrete Vocabulary: A VQ-VAE quantizes the continuous landscape into a learnable codebook of 8,192 "protein words".
  3. Latent GPT: An autoregressive model is trained directly on this compressed, discrete vocabulary to learn the combinatorial grammar of protein architecture.

ProtWord Architecture Figure 1A: The ProtWord framework. From hierarchical continuous modeling to discrete evolutionary protein words, and finally to a de novo protein generator.

⚡ Near-Linear Computational Efficiency

By decoupling local steric constraints from global topology, ProtWord (at only 150M parameters) achieves near-linear computational scaling. It effortlessly avoids the Out-Of-Memory (OOM) bottlenecks typical of standard continuous-space transformers on long sequences, while achieving highly competitive performance in remote homology and mutation effect prediction.

Computational Efficiency Figure S1 (Performance): Inference Latency and Peak Memory Usage. ProtWord scales highly efficiently compared to standard architectures.

🐁 From in silico to in vivo

We demonstrate the discovery potential of this semantic axis by identifying C7orf57 (ADMAP1) as a previously uncharacterized regulator of sperm motility. This discovery was validated via CRISPR-Cas9 knockout mice, where we observed severe sperm motility impairment and axonemal defects.


📂 Repository Structure

Due to GitHub's file size limits, massive matrices, evaluation datasets, and model weights (6GB+) are hosted on Zenodo.

.
├── data/                  # Empty on GitHub (Download plot data from Zenodo)
├── figure/                # Jupyter Notebooks for reproducing paper figures
├── images/                # Readme assets (Figure_1A.png, Figure_S1_efficiency.png)
└── ProtWord/
    ├── checkpoints/       # Empty on GitHub (Download weights from Zenodo)
    ├── data/              # Empty on GitHub (Evaluation datasets from Zenodo)
    ├── protword/          # Core Neural Network Modules (Encoder, VQ, GPT, Contact)
    └── scripts/           # Ready-to-use inference scripts

🛠️ Installation & Requirements

Ensure you have Anaconda/Miniconda installed.

# 1. Clone the repository
git clone https://github.com/young55775/ProtWord.git
cd ProtWord

# 2. Create conda environment
conda create -n protword python=3.11
conda activate protword

# 3. Install PyTorch (Adjust CUDA version if necessary for your hardware)
pip install torch torchvision torchaudio --index-url [https://download.pytorch.org/whl/cu121](https://download.pytorch.org/whl/cu121)

# 4. Install required dependencies
pip install fair-esm biopython seaborn matplotlib tqdm numpy scipy

# 5. Download Weights
# Download `checkpoints.zip` from our Zenodo repository and extract it into `ProtWord/checkpoints/`

🚀 Quick Start / Usage

We provide several out-of-the-box scripts for inference and analysis. Please run all commands from inside the ProtWord/ directory.

cd ProtWord

1. Zero-shot Mutation Effect Prediction (In silico DMS)

Generate a full mutation heatmap for a given sequence (No training required!).

python scripts/mutation_heatmap.py \
    --ckpt ./checkpoints/encoder_t12_150M.pth \
    --seq "MQAIKCVVVGDGAVGKT" \
    --output ./mutation_heatmap.png

2. Map Evolutionary Conservation to PDB B-factors

Score a 3D structure using ProtWord and output a new PDB file where B-factors reflect mutational sensitivity. (Open the output in PyMOL and run spectrum b, blue_white_red to visualize!)

python scripts/score_structure.py \
    --ckpt ./checkpoints/encoder_t12_150M.pth \
    --pdb ./Rac1a.pdb \
    --output ./Rac1a_score.pdb \
    --score_mode mean_loss

3. Discretize Sequence to VQ Codes

Convert a natural protein sequence into its compressed, discrete "protein words" (latent tokens).

python scripts/seq_to_codes.py \
    --encoder_ckpt ./checkpoints/encoder_t12_150M.pth \
    --vq_ckpt ./checkpoints/vqvae_8192.pth \
    --seq "MSLLSRVRRFKVFVD"

4. Generative Design (Latent GPT)

Let the Latent GPT model dream up new protein sequences based on the 8,192-token discrete codebook.

Option A: Pure Unconditional Generation (Explore the natural protein manifold)

python scripts/gpt_sample.py \
    --gpt_ckpt ./checkpoints/gpt_8192.pth \
    --vq_ckpt ./checkpoints/vqvae_8192.pth \
    --similarity_threshold 0.4 \
    --num_samples 100 \
    --top_k 50 --top_p 0.95 --temperature 1.0

Option B: Family-Specific Generation (e.g., de novo Cofilin variants)

python scripts/gpt_sample.py \
    --gpt_ckpt ./checkpoints/cofilin.pth \
    --vq_ckpt ./checkpoints/vqvae_8192.pth \
    --similarity_threshold 0.6 \
    --num_samples 100 \
    --top_k 50 --top_p 0.95 --temperature 1.0

📜 Citation

If you find our model, codebooks, or the in vivo CASA tracking datasets useful, please cite our preprint:

@article{guo2026protword,
  title={A Discrete Language of Protein Words for Functional Discovery and Design},
  author={Guo, Zhengyang and Wang, Zi and Chai, Yongping and Xu, Kaiming and Li, Ming and Li, Wei and Ou, Guangshuo},
  journal={bioRxiv},
  year={2026},
  doi={Pending}
}

⚠️ License & Commercial Use

  • Experimental Data: Released under CC BY 4.0.
  • Model Weights & Codebooks: Governed by the ProtWord Open RAIL-M License (Strict biosecurity use restrictions apply. See checkpoints/LICENSE.txt).
  • Source Code: Released under the ProtWord Academic and Non-Commercial Research License.

Patent Notice: The core methods and architectures implemented in ProtWord are protected by pending patent applications. The source code is freely available for academic and non-commercial research only.

💼 For Commercial Licensing: If you represent a pharmaceutical company, biotech startup, or any for-profit entity wishing to use ProtWord for commercial protein design or target discovery, please contact guozy23@mails.tsinghua.edu.cn to inquire about commercial licensing.

About

A 150M discrete protein language model that translates sequences into an 8,192-token vocabulary for functional discovery and de novo design.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors