A step-by-step protocol for predicting protein mono-, multi-mer structures and conformations using ColabFold/AlphaFold2
ColabFold has two interfaces:
- Web-based jupyter notebooks utilizing Google Colaboratory
- Command-line tools (local)
Our protocol guides readers in three scenarios with both interfaces: (1) monomer prediction, (2) complex prediction, and (3) conformation sampling.
We demonstrate the use of ColabFold with the human glycosylphosphatidylinositol transamidase (GPIT) protein for monomer and complex prediction, and the human Alanine Serine Transporter 2 (ASCT2) (alternative conformation) for conformation sampling.
Protein sequence queries (FASTA/CSV) used in this protocol can be found in the /query
directory
Corresponding experimental structures (PDB) can be found in the /ref
directory
- Monomer, Complex: GPIT complex
- Conformation: ASCT2 (inward-open), ASCT2 (outward-open)
To open the notebooks in Google Colaboratory environment, press button on the top of each notebook.
While most of the notebooks can be executed for free, complex prediction (GPIT_complex.ipynb
) requires a paid Colab Pro account, due to its long length.
Also, please note that the notebooks are provided as guides for tuning parameters and are not intended for rerunning, considering the potential changes in the Google Colaboratory environment itself. To run the ColabFold-AlphaFold2 notebook from the beginning, navigate here.
- Monomer: PIGU, PIGT, PIGK, PIGS, GPAA1
- Multimer: GPIT complex
- Conformation: ACST2 (Dropout), ACST2 (MSA Depth Reduction)
The results of the notebooks can be found in the /web/result
directory. It includes the following output:
- predicted structures (PDB)
- generated MSAs (A3M)
- confidence measures (JSON, log.txt): pLDDT, PAE, pTM, ipTM (for complex prediction)
- images (PNG) for visualizing MSA sequence coverage and confidence measures
For detailed instructions on how to install ColabFold locally, refer to ColabFold or localcolabfold.
To run ColabFold locally, one can use this command line:
colabfold_batch input_seqs.fasta /path/to/results
To run the procedures in this protocol locally, please follow the steps below.
# Clone this repository
git clone https://github.com/steineggerlab/colabfold-protocol.git
# move to the directory
cd colabfold-protocol/batch
# Run colabfold for each monomer
sh script/run_colabfold_GPIT_monomer.sh
# Predict the structure of GPIT complex with alphafold_multimer_v3
sh script/run_colabfold_GPIT_complex.sh
# Predict complex structure pairwisely
sh script/run_colabfold_GPIT_pair.sh
# Validate the predicted structure by aligning to the experimental structure
sh script/validate_colabfold_prediction.sh
# Render the structure alignment (ChimeraX required)
open script/render_structure_alignment.cxc
The results of the local predictions can be found in the /batch/result
directory, and are organized in the same way as the web notebook output.
This procedure generates multiple conformational states from a single input sequence by increasing the uncertainty of the AlphaFold2 model network. To this end, we use two different strategies: (1) MSA depth reduction and (2) Dropout layers activation.
For each strategy, set ColabFold parameters as follows:
- MSA depth reduction:
max_msa=32:64
,num_seeds=16
- Dropout:
use-dropout
,num_seeds=16
With the above setting, a total of 80 structures will be generated for each strategy, and the results will be found in the web/result/conformation
and batch/result/conformation
directories for web and local predictions, respectively. Considering its huge size, we only provide 20 representative structures for each strategy.
This part processes the positions of the amino-acids' alpha carbons under each model with PCA analysis using CPPTRAJ. Its aim is to reduce the dimensionality from 451 parameters (# processed residues) to only few, which capture most of the conformational movements. One can identify multiple conformational states based on this PCA result, by selecting the representative structures furthest apart from each other along the PC1 and/or PC2 axis (depending on the amount of variance captured by each PC).
The provided CPPTRAJ script (run with this bash script) performs the following steps. This script is largely based on the script provided by del Alamo et al.
- Trim off terminal stretches of low-pLDDT scores to reduce noise in the PCA.
- Compute the average position of the remaining 451 alpha carbons across the 80 protein models and deduct it from each of the models.
- Compute the covariance matrix of the 451 updated positions.
- Compute the Eigenvalues and Eigenvectors of the covariance matrix, sort them and rearrange the matrix.
- Compute the projections of the 80 models along the first three principal components (PCs).
The results of the PCA can be found in batch/result/conformation/pca
. Some key outputs are:
project.dat
: The first three PCs for each model (you can ignore the last column (PC3) in this case)%eigenvec.dat
: the amount of variance captured by each PC in descending order (2nd column)
-
Kim G, Lee S, Levy Karin E, Kim H, Moriwaki Y, Ovchinnikov S, Steingger M and Mirdita M. Easy and accurate protein structure prediction using ColabFold.
Protocol Exchange (2023) doi: 10.21203/rs.3.pex-2490/v1 -
Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S and Steinegger M. ColabFold: Making protein folding accessible to all.
Nature Methods (2022) doi: 10.1038/s41592-022-01488-1 -
If you’re using AlphaFold, please also cite:
Jumper et al. "Highly accurate protein structure prediction with AlphaFold."
Nature (2021) doi: 10.1038/s41586-021-03819-2 -
If you’re using AlphaFold-multimer, please also cite:
Evans et al. "Protein complex prediction with AlphaFold-Multimer."
biorxiv (2021) doi: 10.1101/2021.10.04.463034v1 -
If you are using RoseTTAFold, please also cite:
Minkyung et al. "Accurate prediction of protein structures and interactions using a three-track neural network."
Science (2021) doi: 10.1126/science.abj8754