🧬 ProtWord: A Discrete Language of Protein Words for Functional Discovery and Design

Welcome to the official repository of ProtWord.

💡 The Idea: Why ProtWord?

Protein Language Models (PLMs) have pioneered the intersection of AI and biology, achieving remarkable breakthroughs by applying Natural Language Processing (NLP) techniques to amino acid sequences. Building upon these foundational works, ProtWord explores a new, physics-aware paradigm.

Amino acids are material entities constrained by local geometry and steric exclusion. Treating them purely as 1D continuous text can be computationally intensive and sometimes obscures these local physical constraints. ProtWord addresses this by introducing a hierarchical, discrete framework:

Convolutional Bottleneck (U-Net style): We compress the sequence 4x, offloading local physical constraints to CNN kernels.
Discrete Vocabulary: A VQ-VAE quantizes the continuous landscape into a learnable codebook of 8,192 "protein words".
Latent GPT: An autoregressive model is trained directly on this compressed, discrete vocabulary to learn the combinatorial grammar of protein architecture.

Figure 1A: The ProtWord framework. From hierarchical continuous modeling to discrete evolutionary protein words, and finally to a de novo protein generator.

⚡ Near-Linear Computational Efficiency

By decoupling local steric constraints from global topology, ProtWord (at only 150M parameters) achieves near-linear computational scaling. It effortlessly avoids the Out-Of-Memory (OOM) bottlenecks typical of standard continuous-space transformers on long sequences, while achieving highly competitive performance in remote homology and mutation effect prediction.

Figure S1 (Performance): Inference Latency and Peak Memory Usage. ProtWord scales highly efficiently compared to standard architectures.

🐁 From in silico to in vivo

We demonstrate the discovery potential of this semantic axis by identifying C7orf57 (ADMAP1) as a previously uncharacterized regulator of sperm motility. This discovery was validated via CRISPR-Cas9 knockout mice, where we observed severe sperm motility impairment and axonemal defects.

📂 Repository Structure

Due to GitHub's file size limits, massive matrices, evaluation datasets, and model weights (6GB+) are hosted on Zenodo.

.
├── data/                  # Empty on GitHub (Download plot data from Zenodo)
├── figure/                # Jupyter Notebooks for reproducing paper figures
├── images/                # Readme assets (Figure_1A.png, Figure_S1_efficiency.png)
└── ProtWord/
    ├── checkpoints/       # Empty on GitHub (Download weights from Zenodo)
    ├── data/              # Empty on GitHub (Evaluation datasets from Zenodo)
    ├── protword/          # Core Neural Network Modules (Encoder, VQ, GPT, Contact)
    └── scripts/           # Ready-to-use inference scripts

🛠️ Installation & Requirements

Ensure you have Anaconda/Miniconda installed.

# 1. Clone the repository
git clone https://github.com/young55775/ProtWord.git
cd ProtWord

# 2. Create conda environment
conda create -n protword python=3.11
conda activate protword

# 3. Install PyTorch (Adjust CUDA version if necessary for your hardware)
pip install torch torchvision torchaudio --index-url [https://download.pytorch.org/whl/cu121](https://download.pytorch.org/whl/cu121)

# 4. Install required dependencies
pip install fair-esm biopython seaborn matplotlib tqdm numpy scipy

# 5. Download Weights
# Download `checkpoints.zip` from our Zenodo repository and extract it into `ProtWord/checkpoints/`

🚀 Quick Start / Usage

We provide several out-of-the-box scripts for inference and analysis. Please run all commands from inside the ProtWord/ directory.

cd ProtWord

1. Zero-shot Mutation Effect Prediction (In silico DMS)

Generate a full mutation heatmap for a given sequence (No training required!).

python scripts/mutation_heatmap.py \
    --ckpt ./checkpoints/encoder_t12_150M.pth \
    --seq "MQAIKCVVVGDGAVGKT" \
    --output ./mutation_heatmap.png

2. Map Evolutionary Conservation to PDB B-factors

Score a 3D structure using ProtWord and output a new PDB file where B-factors reflect mutational sensitivity. (Open the output in PyMOL and run spectrum b, blue_white_red to visualize!)

python scripts/score_structure.py \
    --ckpt ./checkpoints/encoder_t12_150M.pth \
    --pdb ./Rac1a.pdb \
    --output ./Rac1a_score.pdb \
    --score_mode mean_loss

3. Discretize Sequence to VQ Codes

Convert a natural protein sequence into its compressed, discrete "protein words" (latent tokens).

python scripts/seq_to_codes.py \
    --encoder_ckpt ./checkpoints/encoder_t12_150M.pth \
    --vq_ckpt ./checkpoints/vqvae_8192.pth \
    --seq "MSLLSRVRRFKVFVD"

4. Generative Design (Latent GPT)

Let the Latent GPT model dream up new protein sequences based on the 8,192-token discrete codebook.

Option A: Pure Unconditional Generation (Explore the natural protein manifold)

python scripts/gpt_sample.py \
    --gpt_ckpt ./checkpoints/gpt_8192.pth \
    --vq_ckpt ./checkpoints/vqvae_8192.pth \
    --similarity_threshold 0.4 \
    --num_samples 100 \
    --top_k 50 --top_p 0.95 --temperature 1.0

Option B: Family-Specific Generation (e.g., de novo Cofilin variants)

python scripts/gpt_sample.py \
    --gpt_ckpt ./checkpoints/cofilin.pth \
    --vq_ckpt ./checkpoints/vqvae_8192.pth \
    --similarity_threshold 0.6 \
    --num_samples 100 \
    --top_k 50 --top_p 0.95 --temperature 1.0

📜 Citation

If you find our model, codebooks, or the in vivo CASA tracking datasets useful, please cite our preprint:

@article{guo2026protword,
  title={A Discrete Language of Protein Words for Functional Discovery and Design},
  author={Guo, Zhengyang and Wang, Zi and Chai, Yongping and Xu, Kaiming and Li, Ming and Li, Wei and Ou, Guangshuo},
  journal={bioRxiv},
  year={2026},
  doi={Pending}
}

⚠️ License & Commercial Use

Experimental Data: Released under CC BY 4.0.
Model Weights & Codebooks: Governed by the ProtWord Open RAIL-M License (Strict biosecurity use restrictions apply. See checkpoints/LICENSE.txt).
Source Code: Released under the ProtWord Academic and Non-Commercial Research License.

Patent Notice: The core methods and architectures implemented in ProtWord are protected by pending patent applications. The source code is freely available for academic and non-commercial research only.

💼 For Commercial Licensing: If you represent a pharmaceutical company, biotech startup, or any for-profit entity wishing to use ProtWord for commercial protein design or target discovery, please contact guozy23@mails.tsinghua.edu.cn to inquire about commercial licensing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 ProtWord: A Discrete Language of Protein Words for Functional Discovery and Design

💡 The Idea: Why ProtWord?

⚡ Near-Linear Computational Efficiency

🐁 From in silico to in vivo

📂 Repository Structure

🛠️ Installation & Requirements

🚀 Quick Start / Usage

1. Zero-shot Mutation Effect Prediction (In silico DMS)

2. Map Evolutionary Conservation to PDB B-factors

3. Discretize Sequence to VQ Codes

4. Generative Design (Latent GPT)

📜 Citation

⚠️ License & Commercial Use

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
ProtWord		ProtWord
figures		figures
images		images
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

🧬 ProtWord: A Discrete Language of Protein Words for Functional Discovery and Design

💡 The Idea: Why ProtWord?

⚡ Near-Linear Computational Efficiency

🐁 From in silico to in vivo

📂 Repository Structure

🛠️ Installation & Requirements

🚀 Quick Start / Usage

1. Zero-shot Mutation Effect Prediction (In silico DMS)

2. Map Evolutionary Conservation to PDB B-factors

3. Discretize Sequence to VQ Codes

4. Generative Design (Latent GPT)

📜 Citation

⚠️ License & Commercial Use

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages