[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/delalamo/af2_conformations/blob/main/notebooks/choose_templates.ipynb)

# Conformationally selective AlphaFold predictions

This notebook provides an interface for predicting the structures of proteins using AlphaFold [1]. It simplifies the use of custom templates for the prediction of specific conformations. **Its intended audience are users familiar with Python.** The code borrows heavily from ColabFold [2], and makes use of the same MMSeqs2 API for retrieval of sequence alignments and templates [3,4]. Users of this notebook should cite these publications (listed below).

The fundamental differences between this notebook and those provided by DeepMind and ColabFold are that 1) it simplifies the tuning of specific parameters by exposing them directly to the user, and 2) it allows users to specify which templates should be retrieved from the PDB and used for modeling. The former is useful when various parameters need to be chosen (e.g. MSA depth), while the latter allows targeting of specific conformational subspaces.

Some notes and caveats:
* Template subsampling is turned on by default. This should have no impact for predictions using four or fewer total templates (turned off in AlphaFold and ColabFold).
* Currently only the structures of monomers can be predicted.
* Relax is disabled. If you plan on evaluating these structures using an energy function, be sure to minimize them using OpenMM [5] or Rosetta [6] beforehand.
* Not all PDBs are in the MMSeqs2 template database. There is a chance that PDBs of interest will not be retrieved.
* Templates are aligned based on sequence similarity, not structural similarity. This may pose a problem when using distantly related proteins as templates.
* We removed many of the bells and whistles of other colab notebooks, including pLDDT-based model ranking, visualization of sequence alignment coverage, progress bars, etc.

Models can be downloaded either at the end of the run or incrementally while the program is still running. For the latter, click the folder icon on the left sidebar, hovering over the file of interest and click the three vertical dots, and select "download".

In [None]:
#@title Set up Colab environment (1 of 2)
%%bash

# get templates
git clone https://github.com/delalamo/af2_conformations.git

# get AF2
git clone https://github.com/deepmind/alphafold.git
pip3 install -r ./alphafold/requirements.txt

mv alphafold alphafold_
mv alphafold_/alphafold .
rm -r alphafold_
# remove "END" from PDBs, otherwise biopython complains
sed -i "s/pdb_lines.append('END')//" /content/alphafold/common/protein.py
sed -i "s/pdb_lines.append('ENDMDL')//" /content/alphafold/common/protein.py

# download model params (~1 min)
mkdir params
curl -fsSL https://storage.googleapis.com/alphafold/alphafold_params_2021-07-14.tar | tar x -C params

# download libraries for interfacing with MMseqs2 API
apt-get -y update
apt-get -y install jq curl zlib1g gawk

# setup conda
wget -qnc https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -bfp /usr/local  2>&1 1>/dev/null
rm Miniconda3-latest-Linux-x86_64.sh

# setup template search
conda install -q -y  -c conda-forge -c bioconda kalign3=3.2.2 hhsuite=3.3.0 python=3.7

In [None]:
#@title Set up Colab environment (2 of 2)

from google.colab import files

from af2_conformations.scripts import predict
from af2_conformations.scripts import util
from af2_conformations.scripts import mmseqs2

import random
import os

from absl import logging
logging.set_verbosity(logging.DEBUG)

Once everything has been installed, the code below can be modified and executed.

In [None]:
jobname = 'T4_lysozyme'
sequence = ("MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELDKAIGRNCNGVIT"
            "KDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRCALINMVFQMGETGVAGFTNSL"
            "RMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL" )

# PDB IDs, written uppercase with chain ID specified
pdbs = ["6LB8_A",
        "6LB8_C",
        "4PK0_A",
        "6FW2_A"]

# The MMSeqs2Runner object submits the amino acid sequence to
# the MMSeqs2 server, generates a directory, and populates it with
# data retrieved from the server. Templates may be specified by the user.
# All templates are fetched if none are provided or the list is empty.
mmseqs2_runner = mmseqs2.MMSeqs2Runner( jobname, sequence )

# Fetch sequences and download data
a3m_lines, template_path = mmseqs2_runner.run_job( templates = pdbs )

# A nested loop in which 5 models are generated per MSA depth value
# In our manuscript we use three MSA depths: 32 sequences, 128, and 5120
for nseq in range( 16, 34 ):
  for n_model in range( 5 ):

    # Randomly choose one of the two AlphaFold neural
    # networks capable of using templates.
    # In our experience, model 1 is more sensitive to input templates.
    # However, this observation is purely anecdotal and not backed up by
    # hard numbers.
    model_id = random.choice( ( 1, 2 ) )

    # Specify the name of the output PDB
    outname = f"{ n_model }_{ nseq }.pdb"

    # Run the job and save as a PDB
    predict.predict_structure_from_templates(
        mmseqs2_runner.seq, # NOTE mmseqs2_runner removes whitespace from seq
        outname,
        a3m_lines,
        template_path = template_path,
        model_id = model_id,
        max_msa_clusters = nseq // 2,
        max_extra_msa = nseq,
        max_recycles = 1
    )

    # Alternatively, users can run a template-free prediction by uncommenting
    # the line below:

    '''
    predict.predict_structure_no_templates( sequence, outname,
         a3m_lines, model_id = model_id, max_msa_clusters = nseq // 2,
         max_extra_msa = nseq, max_recycles = 1 )
    '''

# To download predictions:
!zip -FSr "af2.zip" *".pdb"
files.download( "af2.zip" )

# References:
1. Jumper et al "Highly accurate protein structure prediction with AlphaFold" Nature (2021)
2. Mirdita et al "ColabFold - making protein folding accessible to all" biorXiv (2021)
3. Steinegger & Söding "MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets" Nature Biotechnology (2017)
4. Mirdita et al "MMseqs2 desktop and local web server app for fast, integrative sequence searches" Bioinformatics (2019)
5. Eastman et al "OpenMM 7: Rapid development of high performance algorithms for molecular dynamics" Plos Comp Bio (2017)
6. Koehler-Leman et al "Macromolecular modeling and design in Rosetta: recent methods and frameworks" Nature Methods (2020)