<a href="https://colab.research.google.com/github/sokrypton/ColabDesign/blob/main/af/design.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#AfDesign (beta version)
Backprop through AlphaFold for protein design.

**WARNING**
1.   This notebook is in active development and was designed for demonstration purposes only.
2.   Using AfDesign as the only "loss" function for design might be a bad idea, you may find adversarial sequences (aka. sequences that trick AlphaFold).
3.   The current setup is limited to max length of ~256 amino acids length.

**CHANGE LOG**
*   07Feb2022 - refactored the optimizer code. Changed the default to NSGD (normalized SGD to match TrDesign).
*   08Feb2022 - fixed bug allowing msa design mode to work with `binder` design protocol
*   19Feb2022 - making `dropout` a dynamic option (can be turned on/off and rescaled during optimization)
*   20Feb2022 - `num_models`, `model_mode` and `model_parallel` options refactored.

In [None]:
#@title install
%%bash
if [ ! -d af_backprop ]; then
  git clone https://github.com/sokrypton/af_backprop.git
  pip -q install biopython dm-haiku==0.0.5 ml-collections py3Dmol
fi
if [ ! -d params ]; then
  mkdir params
  curl -fsSL https://storage.googleapis.com/alphafold/alphafold_params_2021-07-14.tar | tar x -C params
fi
wget -qnc https://raw.githubusercontent.com/sokrypton/ColabFold/main/beta/colabfold.py
wget -qnc https://raw.githubusercontent.com/sokrypton/ColabDesign/main/af/design.py

In [None]:
#@title import libraries
import sys
sys.path.append('/content/af_backprop')

import os
from google.colab import files
import numpy as np
from IPython.display import HTML
from design import mk_design_model, clear_mem

#########################
def get_pdb(pdb_code=""):
  if pdb_code is None or pdb_code == "":
    upload_dict = files.upload()
    pdb_string = upload_dict[list(upload_dict.keys())[0]]
    with open("tmp.pdb","wb") as out: out.write(pdb_string)
    return "tmp.pdb"
  else:
    os.system(f"wget -qnc https://files.rcsb.org/view/{pdb_code}.pdb")
    return f"{pdb_code}.pdb"

In [None]:
#@title ##define global options


##############################################################
# GET OPTIONS
##############################################################
#@markdown ###model options

num_models = 1 #@param ["1", "2", "3", "4", "5"] {type:"raw"}
#@markdown - `num_models` - number of model params to use at each iteration.
model_mode = "sample" #@param ["sample", "fixed"]
#@markdown - `sample` - randomly select models params to use. (Recommended)
#@markdown - `fixed` - use the same model params each iteration.
model_parallel = False #@param {type:"boolean"}
#@markdown - `model_parallel` - run model params in parallel if `num_models` > 1. (may speedup run, if you have access to high-end GPU)


#@markdown ###recycle options
num_recycles = 0 #@param ["0", "1", "2", "3"] {type:"raw"}
#@markdown - `num_recycles` - max number of recycles to use during design (for denovo proteins we find 0 is often enough)
recycle_mode = "sample" #@param ["sample", "add_prev", "last", "backprop"]
#@markdown - `sample` - at each iteration, randomly select number of recycles to use. (Recommended)
#@markdown - `add_prev` - add prediction logits (dgram, pae, plddt) across all recycles. (Most stable, but slow and requires more memory).
#@markdown - `last` - only use gradients from last recycle.
#@markdown - `backprop` - use outputs from last recycle, but backprop through all recycles.


OPT = {"num_models":num_models, "model_mode":model_mode, "model_parallel":model_parallel,
       "num_recycles":num_recycles, "recycle_mode":recycle_mode}

# fixed backbone design (fixbb)
For a given protein backbone, generate/design a new sequence that AlphaFold thinks folds into that conformation. 

---

**weights of the model**
- `dgram_cce` - minimizes the categorical-cross-entropy between predicted distogram (binned distance matrix) and one extracted from pdb
- `fape`      - minimize difference between coordinates (see AlphaFold paper)
- `pae`       - minimizes the predicted alignment error
- `plddt`     - maximizes the predicted LDDT
- `msa_ent`   - minimize entropy for MSA design (see example at the end of notebook)

**notes**
- `pae` and `plddt` values are between 0 and 1 (where lower is better for both)
- we find `dgram_cce` loss to be more stable for design (compared to `fape`)
- For **optimization** we provide 4 different functions:
 - `design_logits()` - optimize `logits` inputs (continious)
 - `design_prob()` - optimize `softmax(logits)` inputs (probabilities)

 For complex topologies, we find directly optimizing one_hot encoded sequence `design_prob(hard=True)` to be very challenging. To get around this problem, we propose optimizing in 2 or 3 stages.
 - `design_2stage()` - `prob` → `hard`
 - `design_3stage()` - `logits` → `prob` → `hard`


In [None]:
#@markdown inputs
protocol = "fixbb"
pdb_code = "1TEN" #@param {type:"string"}
chain = "A" #@param ["A", "B", "C"] {allow-input: true}

clear_mem()
model = mk_design_model(**OPT, protocol=protocol)
model.prep_inputs(pdb_filename=get_pdb(pdb_code), chain=chain)

print("length",  model._len)
print("weights", model.opt["weights"])

In [None]:
# model.restart() is not needed the first time you run, but can be used to
# restart trajectory without needing to recompile the model
model.restart()
model.design_3stage()

In [None]:
model.plot_traj()  

In [None]:
HTML(model.animate())

In [None]:
model.plot_pdb()

In [None]:
model.save_pdb(f"{pdb_code}.{model.protocol}.pdb")

# hallucination
For a given length, generate/hallucinate a protein sequence that AlphaFold thinks folds into a well structured protein (high plddt, low pae, many contacts).

---
**weights of the model**
- `con` - maximize number of contacts. (We find just minimizing `plddt` results in single long helix, and maximizing `pae` results in a two helix bundle. To encourage compact structures we add a `con` term)


In [None]:
#@markdown inputs
protocol = "hallucination"
length =  100#@param {type:"raw"}
copies =  1#@param {type:"raw"}

clear_mem()
model = mk_design_model(**OPT, protocol=protocol)
model.prep_inputs(length=length, copies=copies)

print("length",model._len)
print("weights",model.opt["weights"])

In [None]:
###########################
# For hallucination, default initialization often converges to all-helical proteins.
# For this task, we recommend gumbel initialization w/ design_2stage()
###########################
model.restart(seq_init="gumbel")
model.design_2stage()

In [None]:
HTML(model.animate())

In [None]:
model.get_seqs()

In [None]:
model.plot_pdb()

In [None]:
model.save_pdb(f"{model.protocol}.pdb")

# binder hallucination
For a given protein target and protein binder length, generate/hallucinate a protein binder sequence AlphaFold thinks will bind to the target structure. To do this, we minimize PAE and maximize number of contacts at the interface and within the binder, and we maximize pLDDT of the binder.

---
**weights of the model**
- WARNING: the default weights were choosen arbitrary, it might help to change these!
- `pae_inter` - minimize PAE interface of the proteins
- `pae_intra` - minimize PAE within binder
- `con_inter` - maximize number of contacts at the interface of the proteins
- `con_intra` - maximize number of contacts within binder

In [None]:
#@markdown inputs
protocol = "binder"
pdb_code = "4MZK" #@param {type:"string"}
chain = "A" #@param ["A", "B", "C"] {allow-input: true}
binder_length =  19#@param {type:"integer"}

clear_mem()
model = mk_design_model(**OPT, protocol=protocol)
model.prep_inputs(pdb_filename=get_pdb(pdb_code), chain=chain,
                         binder_len=binder_length)

print("target_length",model._target_len)
print("binder_length",model._binder_len)
print("weights",model.opt["weights"])

In [None]:
model.restart()
model.design_2stage()

In [None]:
HTML(model.animate())

In [None]:
model.get_seqs()

In [None]:
model.plot_pdb()

In [None]:
model.save_pdb(f"{model.protocol}.pdb")