<a href="https://colab.research.google.com/github/wwl5600/wwl5600/blob/main/LigandMPNN_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **LigandMPNN - Colab**

Atomic context-conditioned protein sequence design using LigandMPNN - [Paper](https://www.biorxiv.org/content/10.1101/2023.12.22.573103v1)

This Colab notebook provides inference code for [LigandMPNN](https://github.com/dauparas/LigandMPNN) & [ProteinMPNN](https://github.com/dauparas/ProteinMPNN) models. The code and model parameters are available under the MIT license.

This Colab Notebook covers all 33-examples of the original LigandMPNN discussed by Justus Dauparas at - [**Examples**](https://github.com/dauparas/LigandMPNN#:~:text=0.20A%20Gaussian%20noise-,Examples,-1%20default)



# **1. How to run the Examples from github LigandMPNN repo**
```
You can run all 33-examples, using the original inputs PDBs after getting LigandMPNN models and dependecies installed. The input PDBs files will be downloaded after installation.

*   Head to any cell and simply run using play button  
*   Check/remember the file name from the --out_folder in the cell section
*   Now from the left side of this notebook, browse to LigandMPNN folder---> ouputs folder, check the file name, right click and download it.
*   You can set name to anything that you like at --out_folder "./outputs/default" option. But It is recommended and let the input and outputs remain original.
```

#**2. How to Run LigandMPNN using your PDB files  as input**

First run the **Get LigandMPNN models and install depencies -step**

I assume you have a pdb file 7KR0

1.   Now Upload **7KR0.pdb** to the inputs folder of this notebook
2.   Fom the left side of this notebook, browse to LigandMPNN folder ---> inputs folder, then upload and click refresh to check its in there.

Now forexample I want to run **Example 1 default**
3.   So in the cell, --pdb_path "./inputs/**1BC8.pdb**", replace the 1BC8.pdb with **7KR0.pdb** that you uploaded
4.   Also in the --out_folder "./outputs/**default**", replace the default folder name to anything that you like "./outputs/**results**"
5.   Now run the cell using the play button
6.   Download your file/folder from leftside of this notebook, browse to LigandMPNN folder---> then outputs folder and check the name **results** and download.

#**Here's a simple** [Tutorial](https://youtu.be/LFsxLVqPQho?feature=shared) and also available on my profile [X](https://x.com/CryoKhan/status/1750595099464233416?s=20)

# **Important!**
**It is very important to set the python script according to your needs in any of the example that you're interested.**

To get most out of the script, first run/check the last cell outputs description. There are so many options from changing models to fix residues to chains to design.

**This is the script where you can change anything according to your needs**

!python run.py


**Forexample;**
From the outputs of the last cell, which shows the script usage options, you can add as many options as you need. One can change the model by simply adding the option like;

!python run.py \
        **--model_type "ligand_mpnn"** \

There are so many option, please check the last cell script usage options.


#**What works!**
- All the original [**Examples**](https://github.com/dauparas/LigandMPNN#:~:text=0.20A%20Gaussian%20noise-,Examples,-1%20default) from github LigandMPNN repo.
- Works on any PDB file, upload it and set the python script according to your need.

This Colab is just for demonstration purpose I recommend using the original installations of LigandMPNN.

If anyone interested to improve, download, change etc please go ahead and this is all yours.

Connect with me: [LinkedIn](https://www.linkedin.com/in/samee-ullah-structural-biologist/)

**Get LigandMPNN models and install dependencies, wait for 2-5 minutes to finish**

Please don't worry about the warnings

In [None]:
!git clone https://github.com/ullahsamee/LigandMPNN.git
%cd LigandMPNN
!bash get_model_params.sh "./model_params"

# Install Miniconda
!wget https://repo.anaconda.com/miniconda/Miniconda3-py311_23.11.0-2-Linux-x86_64.sh
!chmod +x Miniconda3-py311_23.11.0-2-Linux-x86_64.sh
!bash ./Miniconda3-py311_23.11.0-2-Linux-x86_64.sh -b -f -p /usr/local

# Update PATH
import sys
sys.path.append('/usr/local/lib/python3.11/site-packages/')

# Install packages
!conda create -n ligandmpnn_env python=3.11 -y
!source activate ligandmpnn_env
!conda activate ligandmpnn_env
!pip install torch
!pip install prody==2.4.1
!pip install pyparsing==3.1.1

**Examples:**

**1 default**

Default settings will run ProteinMPNN.

In [None]:
%cd /content/LigandMPNN
!python run.py \
        --seed 111 \
        --pdb_path "./inputs/1BC8.pdb" \
        --out_folder "./outputs/default"

**2 --temperature**

--temperature 0.05 Change sampling temperature (higher temperature gives more sequence diversity).


In [None]:
!python run.py \
        --seed 111 \
        --pdb_path "./inputs/1BC8.pdb" \
        --temperature 0.05 \
        --out_folder "./outputs/temperature"

**3 --seed**

--seed Not selecting a seed will run with a random seed. Running this multiple times will give different results.

In [None]:
!python run.py \
        --pdb_path "./inputs/1BC8.pdb" \
        --out_folder "./outputs/random_seed"

**4 --verbose**

--verbose 0 Do not print any statements.



In [None]:
!python run.py \
        --seed 111 \
        --verbose 0 \
        --pdb_path "./inputs/1BC8.pdb" \
        --out_folder "./outputs/verbose"

**5 --save_stats**

--save_stats 1 Save sequence design statistics.

In [None]:
#['generated_sequences', 'sampling_probs', 'log_probs', 'decoding_order', 'native_sequence', 'mask', 'chain_mask', 'seed', 'temperature']
!python run.py \
        --seed 111 \
        --pdb_path "./inputs/1BC8.pdb" \
        --out_folder "./outputs/save_stats" \
        --save_stats 1

**6 --fixed_residues**

--fixed_residues Fixing specific amino acids. This example fixes the first 10 residues in chain C and adds global bias towards A (alanine). The output should have all alanines except the first 10 residues should be the same as in the input sequence since those are fixed.

In [None]:
!python run.py \
        --seed 111 \
        --pdb_path "./inputs/1BC8.pdb" \
        --out_folder "./outputs/fix_residues" \
        --fixed_residues "C1 C2 C3 C4 C5 C6 C7 C8 C9 C10" \
        --bias_AA "E:10.0"

**7 --redesigned_residues**

--redesigned_residues Specifying which residues need to be designed. This example redesigns the first 10 residues while fixing everything else.



In [None]:
!python run.py \
        --seed 111 \
        --pdb_path "./inputs/1BC8.pdb" \
        --out_folder "./outputs/redesign_residues" \
        --redesigned_residues "C1 C2 C3 C4 C5 C6 C7 C8 C9 C10" \
        --bias_AA "A:10.0"

**8 --number_of_batches**

Design 15 sequences; with batch size 3 (can be 1 when using CPUs) and the number of batches 5.

In [None]:
!python run.py \
        --seed 111 \
        --pdb_path "./inputs/1BC8.pdb" \
        --out_folder "./outputs/batch_size" \
        --batch_size 3 \
        --number_of_batches 5

**9 --bias_AA**

Global amino acid bias. In this example, output sequences are biased towards W, P, C and away from A.

In [None]:
!python run.py \
        --seed 111 \
        --pdb_path "./inputs/1BC8.pdb" \
        --bias_AA "W:3.0,P:3.0,C:3.0,A:-3.0" \
        --out_folder "./outputs/global_bias"

**10 --bias_AA_per_residue**

Specify per residue amino acid bias, e.g. make residues C1, C3, C5, and C7 to be prolines.

In [None]:
# {
# "C1": {"G": -0.3, "C": -2.0, "P": 10.8},
# "C3": {"P": 10.0},
# "C5": {"G": -1.3, "P": 10.0},
# "C7": {"G": -1.3, "P": 10.0}
# }
!python run.py \
        --seed 111 \
        --pdb_path "./inputs/1BC8.pdb" \
        --bias_AA_per_residue "./inputs/bias_AA_per_residue.json" \
        --out_folder "./outputs/per_residue_bias"

**11 --omit_AA**

Global amino acid restrictions. This is equivalent to using --bias_AA and setting bias to be a large negative number. The output should be just made of E, K, A.



In [None]:
!python run.py \
        --seed 111 \
        --pdb_path "./inputs/1BC8.pdb" \
        --omit_AA "CDFGHILMNPQRSTVWY" \
        --out_folder "./outputs/global_omit"

**12 --omit_AA_per_residue**

Per residue amino acid restrictions.

In [None]:
# {
# "C1": "ACDEFGHIKLMNPQRSTVW",
# "C3": "ACDEFGHIKLMNPQRSTVW",
# "C5": "ACDEFGHIKLMNPQRSTVW",
# "C7": "ACDEFGHIKLMNPQRSTVW"
# }
!python run.py \
        --seed 111 \
        --pdb_path "./inputs/1BC8.pdb" \
        --omit_AA_per_residue "./inputs/omit_AA_per_residue.json" \
        --out_folder "./outputs/per_residue_omit"

**13 --symmetry_weights**

Designing sequences with symmetry, e.g. homooligomer/2-state proteins, etc. In this example make C1=C2=C3, also C4=C5, and C6=C7.

In [None]:
#total_logits += symmetry_weights[t]*logits
#probs = torch.nn.functional.softmax((total_logits+bias_t) / temperature, dim=-1)
#total_logits_123 = 0.33*logits_1+0.33*logits_2+0.33*logits_3
#output should be ***ooxx
!python run.py \
        --seed 111 \
        --pdb_path "./inputs/1BC8.pdb" \
        --out_folder "./outputs/symmetry" \
        --symmetry_residues "C1,C2,C3|C4,C5|C6,C7" \
        --symmetry_weights "0.33,0.33,0.33|0.5,0.5|0.5,0.5"

**14 --homo_oligomer**

Design homooligomer sequences. This automatically sets --symmetry_residues and --symmetry_weights assuming equal weighting from all chains.

In [None]:
!python run.py \
        --model_type "ligand_mpnn" \
        --seed 111 \
        --pdb_path "./inputs/4GYT.pdb" \
        --out_folder "./outputs/homooligomer" \
        --homo_oligomer 1 \
        --number_of_batches 2

**15 --file_ending**

Outputs will have a specified ending; e.g. 1BC8_xyz.fa instead of 1BC8.fa

In [None]:
!python run.py \
        --seed 111 \
        --pdb_path "./inputs/1BC8.pdb" \
        --out_folder "./outputs/file_ending" \
        --file_ending "_xyz"

**16 --zero_indexed**

Zero indexed names in /backbones/1BC8_0.pdb, 1BC8_1.pdb, 1BC8_2.pdb etc

In [None]:
!python run.py \
        --seed 111 \
        --pdb_path "./inputs/1BC8.pdb" \
        --out_folder "./outputs/zero_indexed" \
        --zero_indexed 1 \
        --number_of_batches 2

**17 --chains_to_design**

Specify which chains (e.g. "ABC") need to be redesigned, other chains will be kept fixed. Outputs in seqs/backbones will still have atoms/sequences for the whole input PDB.

In [None]:
!python run.py \
        --model_type "ligand_mpnn" \
        --seed 111 \
        --pdb_path "./inputs/4GYT.pdb" \
        --out_folder "./outputs/chains_to_design" \
        --chains_to_design "B"

**18 --parse_these_chains_only**

Parse and design only specified chains (e.g. "ABC"). Outputs will have only specified chains.

In [None]:
!python run.py \
        --model_type "ligand_mpnn" \
        --seed 111 \
        --pdb_path "./inputs/4GYT.pdb" \
        --out_folder "./outputs/parse_these_chains_only" \
        --parse_these_chains_only "B"

**19 --model_type "ligand_mpnn"**

Run LigandMPNN with default settings.

In [None]:
!python run.py \
        --model_type "ligand_mpnn" \
        --seed 111 \
        --pdb_path "./inputs/V5cp3.pdb" \
        --redesigned_residues "A94 A95 A96 A97 A98 A99 A100 A101 A102 H97 H98 H99 H100 H101 H102" \
        --out_folder "./outputs/ligandmpnn_default"

**20 --checkpoint_ligand_mpnn**

Run LigandMPNN using 0.05A model by specifying --checkpoint_ligand_mpnn flag.

In [None]:
!python run.py \
        --checkpoint_ligand_mpnn "./model_params/ligandmpnn_v_32_005_25.pt" \
        --model_type "ligand_mpnn" \
        --seed 111 \
        --pdb_path "./inputs/1BC8.pdb" \
        --out_folder "./outputs/ligandmpnn_v_32_005_25"

**21 --ligand_mpnn_use_atom_context**

Setting --ligand_mpnn_use_atom_context 0 will mask all ligand atoms. This can be used to assess how much ligand atoms affect AA probabilities.

In [None]:
!python run.py \
        --model_type "ligand_mpnn" \
        --seed 111 \
        --pdb_path "./inputs/1BC8.pdb" \
        --out_folder "./outputs/ligandmpnn_no_context" \
        --ligand_mpnn_use_atom_context 0

**22 --ligand_mpnn_use_side_chain_context**

Use fixed residue side chain atoms as extra ligand atoms.



In [None]:
!python run.py \
        --model_type "ligand_mpnn" \
        --seed 111 \
        --pdb_path "./inputs/1BC8.pdb" \
        --out_folder "./outputs/ligandmpnn_use_side_chain_atoms" \
        --ligand_mpnn_use_side_chain_context 1 \
        --fixed_residues "C1 C2 C3 C4 C5 C6 C7 C8 C9 C10"

**23 --model_type "soluble_mpnn"**

Run SolubleMPNN (ProteinMPNN-like model with only soluble proteins in the training dataset).

In [None]:
!python run.py \
        --model_type "soluble_mpnn" \
        --seed 111 \
        --pdb_path "./inputs/1BC8.pdb" \
        --out_folder "./outputs/soluble_mpnn_default"

**24 --model_type "global_label_membrane_mpnn"**

Run global label membrane MPNN (trained with extra input - binary label soluble vs not) --global_transmembrane_label #1 - membrane, 0 - soluble.

In [None]:
!python run.py \
        --model_type "global_label_membrane_mpnn" \
        --seed 111 \
        --pdb_path "./inputs/1BC8.pdb" \
        --out_folder "./outputs/global_label_membrane_mpnn_0" \
        --global_transmembrane_label 0

**25 --model_type "per_residue_label_membrane_mpnn"**

Run per residue label membrane MPNN (trained with extra input per residue specifying buried (hydrophobic), interface (polar), or other type residues; 3 classes).

In [None]:
!python run.py \
        --model_type "per_residue_label_membrane_mpnn" \
        --seed 111 \
        --pdb_path "./inputs/1BC8.pdb" \
        --out_folder "./outputs/per_residue_label_membrane_mpnn_default" \
        --transmembrane_buried "C1 C2 C3 C11" \
        --transmembrane_interface "C4 C5 C6 C22"

**26 --fasta_seq_separation**

Choose a symbol to put between different chains in fasta output format. It's recommended to PDB output format to deal with residue jumps and multiple chain parsing.

In [None]:
!python run.py \
        --pdb_path "./inputs/1BC8.pdb" \
        --out_folder "./outputs/fasta_seq_separation" \
        --fasta_seq_separation ":"

**27 --pdb_path_multi**

Specify multiple PDB input paths. This is more efficient since the model needs to be loaded from the checkpoint once.

In [None]:
#{
#"./inputs/1BC8.pdb": "",
#"./inputs/4GYT.pdb": ""
#}
!python run.py \
        --pdb_path_multi "./inputs/pdb_ids.json" \
        --out_folder "./outputs/pdb_path_multi" \
        --seed 111

**28 --fixed_residues_multi**

Specify fixed residues when using --pdb_path_multi flag.

In [None]:
#{
#"./inputs/1BC8.pdb": "C1 C2 C3 C4 C5 C10 C22",
#"./inputs/4GYT.pdb": "A7 A8 A9 A10 A11 A12 A13 B38"
#}
!python run.py \
        --pdb_path_multi "./inputs/pdb_ids.json" \
        --fixed_residues_multi "./inputs/fix_residues_multi.json" \
        --out_folder "./outputs/fixed_residues_multi" \
        --seed 111

**29 --redesigned_residues_multi**

Specify which residues need to be redesigned when using --pdb_path_multi flag.

In [None]:
#{
#"./inputs/1BC8.pdb": "C1 C2 C3 C4 C5 C10",
#"./inputs/4GYT.pdb": "A7 A8 A9 A10 A12 A13 B38"
#}
!python run.py \
        --pdb_path_multi "./inputs/pdb_ids.json" \
        --redesigned_residues_multi "./inputs/redesigned_residues_multi.json" \
        --out_folder "./outputs/redesigned_residues_multi" \
        --seed 111

**30 --omit_AA_per_residue_multi**

Specify which residues need to be omitted when using --pdb_path_multi flag.

In [None]:
#{
#"./inputs/1BC8.pdb": {"C1":"ACDEFGHILMNPQRSTVWY", "C2":"ACDEFGHILMNPQRSTVWY", "C3":"ACDEFGHILMNPQRSTVWY"},
#"./inputs/4GYT.pdb": {"A7":"ACDEFGHILMNPQRSTVWY", "A8":"ACDEFGHILMNPQRSTVWY"}
#}
!python run.py \
        --pdb_path_multi "./inputs/pdb_ids.json" \
        --omit_AA_per_residue_multi "./inputs/omit_AA_per_residue_multi.json" \
        --out_folder "./outputs/omit_AA_per_residue_multi" \
        --seed 111

**31 --bias_AA_per_residue_multi**

Specify amino acid biases per residue when using --pdb_path_multi flag.



In [None]:
#{
#"./inputs/1BC8.pdb": {"C1":{"A":3.0, "P":-2.0}, "C2":{"W":10.0, "G":-0.43}},
#"./inputs/4GYT.pdb": {"A7":{"Y":5.0, "S":-2.0}, "A8":{"M":3.9, "G":-0.43}}
#}
!python run.py \
        --pdb_path_multi "./inputs/pdb_ids.json" \
        --bias_AA_per_residue_multi "./inputs/bias_AA_per_residue_multi.json" \
        --out_folder "./outputs/bias_AA_per_residue_multi" \
        --seed 111

**32 --ligand_mpnn_cutoff_for_score**

This sets the cutoff distance in angstroms to select residues that are considered to be close to ligand atoms. This flag only affects the num_ligand_res and ligand_confidence in the output fasta files.

In [None]:
!python run.py \
        --model_type "ligand_mpnn" \
        --seed 111 \
        --pdb_path "./inputs/1BC8.pdb" \
        --ligand_mpnn_cutoff_for_score "6.0" \
        --out_folder "./outputs/ligand_mpnn_cutoff_for_score"

**33 specifying residues with insertion codes**

You can specify residue using chain_id + residue_number + insersion_code; e.g. redesign only residue B82, B82A, B82B, B82C.

In [None]:
!python run.py \
        --seed 111 \
        --pdb_path "./inputs/2GFB.pdb" \
        --out_folder "./outputs/insertion_code" \
        --redesigned_residues "B82 B82A B82B B82C" \
        --parse_these_chains_only "B"

**Last - How to use LigandMPNN script more effectively**

In [None]:
!python run.py -h

usage: run.py [-h] [--model_type MODEL_TYPE] [--checkpoint_protein_mpnn CHECKPOINT_PROTEIN_MPNN]
              [--checkpoint_ligand_mpnn CHECKPOINT_LIGAND_MPNN]
              [--checkpoint_per_residue_label_membrane_mpnn CHECKPOINT_PER_RESIDUE_LABEL_MEMBRANE_MPNN]
              [--checkpoint_global_label_membrane_mpnn CHECKPOINT_GLOBAL_LABEL_MEMBRANE_MPNN]
              [--checkpoint_soluble_mpnn CHECKPOINT_SOLUBLE_MPNN]
              [--fasta_seq_separation FASTA_SEQ_SEPARATION] [--verbose VERBOSE]
              [--pdb_path PDB_PATH] [--pdb_path_multi PDB_PATH_MULTI]
              [--fixed_residues FIXED_RESIDUES] [--fixed_residues_multi FIXED_RESIDUES_MULTI]
              [--redesigned_residues REDESIGNED_RESIDUES]
              [--redesigned_residues_multi REDESIGNED_RESIDUES_MULTI] [--bias_AA BIAS_AA]
              [--bias_AA_per_residue BIAS_AA_PER_RESIDUE]
              [--bias_AA_per_residue_multi BIAS_AA_PER_RESIDUE_MULTI] [--omit_AA OMIT_AA]
              [--omit_AA_per_residu

If you use the [LigandMPNN](https://github.com/dauparas/LigandMPNN?tab=readme-ov-file#31---bias_aa_per_residue_multi) cite:
[LigandMPNN(preprint)](https://www.biorxiv.org/content/10.1101/2023.12.22.573103v1)

@article{dauparas2023atomic,
  title={Atomic context-conditioned protein sequence design using LigandMPNN},
  author={Dauparas, Justas and Lee, Gyu Rie and Pecoraro, Robert and An, Linna and Anishchenko, Ivan and Glasscock, Cameron and Baker, David},
  journal={Biorxiv},
  pages={2023--12},
  year={2023},
  publisher={Cold Spring Harbor Laboratory}
}

@article{dauparas2022robust,
  title={Robust deep learning--based protein sequence design using ProteinMPNN},
  author={Dauparas, Justas and Anishchenko, Ivan and Bennett, Nathaniel and Bai, Hua and Ragotte, Robert J and Milles, Lukas F and Wicky, Basile IM and Courbet, Alexis and de Haas, Rob J and Bethel, Neville and others},
  journal={Science},
  volume={378},
  number={6615},  
  pages={49--56},
  year={2022},
  publisher={American Association for the Advancement of Science}
} *italicized text*
