<a href="https://colab.research.google.com/github/tracy-zeng/Mito-TPCA/blob/main/batch/AlphaFold2_batch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#ColabFold v1.5.5: AlphaFold2 w/ MMseqs2 BATCH

<img src="https://raw.githubusercontent.com/sokrypton/ColabFold/main/.github/ColabFold_Marv_Logo_Small.png" height="256" align="right" style="height:256px">

Easy to use AlphaFold2 protein structure [(Jumper et al. 2021)](https://www.nature.com/articles/s41586-021-03819-2) and complex [(Evans et al. 2021)](https://www.biorxiv.org/content/10.1101/2021.10.04.463034v1) prediction using multiple sequence alignments generated through MMseqs2. For details, refer to our manuscript:

[Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S, Steinegger M. ColabFold: Making protein folding accessible to all.
*Nature Methods*, 2022](https://www.nature.com/articles/s41592-022-01488-1)

**Usage**

`input_dir` directory with only fasta files or MSAs stored in Google Drive. MSAs need to be A3M formatted and have an `.a3m` extention. For MSAs MMseqs2 will not be called.

`result_dir` results will be written to the result directory in Google Drive

Old versions: [v1.4](https://colab.research.google.com/github/sokrypton/ColabFold/blob/v1.4.0/batch/AlphaFold2_batch.ipynb), [v1.5.1](https://colab.research.google.com/github/sokrypton/ColabFold/blob/v1.5.1/batch/AlphaFold2_batch.ipynb), [v1.5.2](https://colab.research.google.com/github/sokrypton/ColabFold/blob/v1.5.2/batch/AlphaFold2_batch.ipynb), [v1.5.3-patch](https://colab.research.google.com/github/sokrypton/ColabFold/blob/56c72044c7d51a311ca99b953a71e552fdc042e1/batch/AlphaFold2_batch.ipynb)

<strong>For more details, see <a href="#Instructions">bottom</a> of the notebook and checkout the [ColabFold GitHub](https://github.com/sokrypton/ColabFold). </strong>

-----------

### News
- <b><font color='green'>2023/07/31: The ColabFold MSA server is back to normal. It was using older DB (UniRef30 2202/PDB70 220313) from 27th ~8:30 AM CEST to 31st ~11:10 AM CEST.</font></b>
- <b><font color='green'>2023/06/12: New databases! UniRef30 updated to 2023_02 and PDB to 230517. We now use PDB100 instead of PDB70 (see notes in the [main](https://colabfold.com) notebook).</font></b>
- <b><font color='green'>2023/06/12: We introduced a new default pairing strategy: Previously, for multimer predictions with more than 2 chains, we only pair if all sequences taxonomically match ("complete" pairing). The new default "greedy" strategy pairs any taxonomically matching subsets.</font></b>

In [70]:
#@title Mount google drive
# from google.colab import drive
# drive.mount('/content/drive')
from sys import version_info
python_version = f"{version_info.major}.{version_info.minor}"

In [71]:
input_dir = "/content/input_fasta"
shutil.rmtree(input_dir)  # 删除整个文件夹
os.makedirs(input_dir, exist_ok=True)

result_dir = '/content/result'
shutil.rmtree(result_dir)  # 删除整个文件夹
os.makedirs(result_dir, exist_ok=True)

In [72]:
from google.colab import files
import os

# 上传本地 fasta 文件（可以一次上传多个）
uploaded = files.upload()

# 创建输入文件夹
input_dir = '/content/input_fasta'
os.makedirs(input_dir, exist_ok=True)

# 保存上传的文件到 input_dir
for fname in uploaded.keys():
    with open(os.path.join(input_dir, fname), 'wb') as f:
        f.write(uploaded[fname])

print(f"已上传文件: {list(uploaded.keys())}")

Saving test.fasta to test (10).fasta
已上传文件: ['test (10).fasta']


In [73]:
for f in os.listdir(input_dir):
    new_name = re.sub(r"\s*\(\d+\)", "", f)  # 去掉括号和数字
    old_path = os.path.join(input_dir, f)
    new_path = os.path.join(input_dir, new_name)
    if old_path != new_path:
        os.rename(old_path, new_path)
        print(f"✅ Renamed '{f}' -> '{new_name}'")

✅ Renamed 'test (10).fasta' -> 'test.fasta'


In [74]:
print(os.listdir(input_dir))

['test.fasta']


In [75]:
# 自动拆分多序列 fasta
from pathlib import Path

fasta_file = list(Path(input_dir).glob("*.fasta"))[0]
out_dir = Path(input_dir)
with open(fasta_file) as f:
    content = f.read().split(">")[1:]  # 按 > 分割

for block in content:
    lines = block.strip().split("\n")
    header = lines[0]
    seq = "".join(lines[1:])
    # 提取 Uniprot ID，例如 sp|A0AV02|...
    uid = header.split("|")[1]
    out_path = out_dir / f"{uid}.fasta"
    with open(out_path, "w") as out:
        out.write(f">{header}\n{seq}\n")
    print(f"✅ Wrote {out_path}")

# 删除原始的大 fasta，避免重复
Path(fasta_file).unlink()

✅ Wrote /content/input_fasta/A0A087X1C5.fasta
✅ Wrote /content/input_fasta/A0A0B4J2F0.fasta
✅ Wrote /content/input_fasta/A0A0C5B5G6.fasta
✅ Wrote /content/input_fasta/A0A0K2S4Q6.fasta
✅ Wrote /content/input_fasta/A0A0U1RRE5.fasta
✅ Wrote /content/input_fasta/A0A1B0GTW7.fasta
✅ Wrote /content/input_fasta/A0AV02.fasta
✅ Wrote /content/input_fasta/A0AV96.fasta


In [76]:
#@title Input protein sequence, then hit `Runtime` -> `Run all`

input_dir = '/content/input_fasta' #@param {type:"string"}
result_dir = '/content/result' #@param {type:"string"}
os.makedirs(result_dir, exist_ok=True)

# number of models to use
#@markdown ---
#@markdown ### Advanced settings
msa_mode = "single_sequence" #@param ["MMseqs2 (UniRef+Environmental)", "MMseqs2 (UniRef only)","single_sequence","custom"]
num_models = 5 #@param [1,2,3,4,5] {type:"raw"}
num_recycles = 3 #@param [1,3,6,12,24,48] {type:"raw"}
stop_at_score = 100 #@param {type:"string"}
#@markdown - early stop computing models once score > threshold (avg. plddt for "structures" and ptmscore for "complexes")
use_custom_msa = False
num_relax = 0 #@param [0, 1, 5] {type:"raw"}
use_amber = num_relax > 0
relax_max_iterations = 200 #@param [0,200,2000] {type:"raw"}
use_templates = False #@param {type:"boolean"}
do_not_overwrite_results = True #@param {type:"boolean"}
zip_results = False #@param {type:"boolean"}


In [79]:
!pip install appdirs
!pip install pdbfixer openmm

Collecting appdirs
  Using cached appdirs-1.4.4-py2.py3-none-any.whl.metadata (9.0 kB)
Using cached appdirs-1.4.4-py2.py3-none-any.whl (9.6 kB)
Installing collected packages: appdirs
Successfully installed appdirs-1.4.4




In [78]:
#@title Install dependencies
%%bash -s $use_amber $use_templates $python_version

set -e

USE_AMBER=$1
USE_TEMPLATES=$2
PYTHON_VERSION=$3

if [ ! -f COLABFOLD_READY ]; then
  # install dependencies
  # We have to use "--no-warn-conflicts" because colab already has a lot preinstalled with requirements different to ours
  pip install -q --no-warn-conflicts "colabfold[alphafold-minus-jax] @ git+https://github.com/sokrypton/ColabFold"
  if [ -n "${TPU_NAME}" ]; then
    pip install -q --no-warn-conflicts -U dm-haiku==0.0.10 jax==0.3.25
  fi
  ln -s /usr/local/lib/python3.*/dist-packages/colabfold colabfold
  ln -s /usr/local/lib/python3.*/dist-packages/alphafold alphafold
  # hack to fix TF crash
  rm -f /usr/local/lib/python3.*/dist-packages/tensorflow/core/kernels/libtfkernel_sobol_op.so
  touch COLABFOLD_READY
fi

# Download params (~1min)
python -m colabfold.download

# setup conda
if [ ${USE_AMBER} == "True" ] || [ ${USE_TEMPLATES} == "True" ]; then
  if [ ! -f CONDA_READY ]; then
    wget -qnc https://github.com/conda-forge/miniforge/releases/download/25.3.1-0/Miniforge3-25.3.1-0-Linux-x86_64.sh
    bash Miniforge3-25.3.1-0-Linux-x86_64.sh -bfp /usr/local 2>&1 1>/dev/null
    rm Miniforge3-25.3.1-0-Linux-x86_64.sh
    conda config --set auto_update_conda false
    touch CONDA_READY
  fi
fi
# setup template search
if [ ${USE_TEMPLATES} == "True" ] && [ ! -f HH_READY ]; then
  conda install -y -q -c conda-forge -c bioconda kalign2=2.04 hhsuite=3.3.0 python="${PYTHON_VERSION}" 2>&1 1>/dev/null
  touch HH_READY
fi
# setup openmm for amber refinement
if [ ${USE_AMBER} == "True" ] && [ ! -f AMBER_READY ]; then
  conda install -y -q -c conda-forge openmm=8.2.0 python="${PYTHON_VERSION}" pdbfixer 2>&1 1>/dev/null
  touch AMBER_READY
fi

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/content/colabfold/download.py", line 6, in <module>
    import appdirs
ModuleNotFoundError: No module named 'appdirs'


CalledProcessError: Command 'b'\nset -e\n\nUSE_AMBER=$1\nUSE_TEMPLATES=$2\nPYTHON_VERSION=$3\n\nif [ ! -f COLABFOLD_READY ]; then\n  # install dependencies\n  # We have to use "--no-warn-conflicts" because colab already has a lot preinstalled with requirements different to ours\n  pip install -q --no-warn-conflicts "colabfold[alphafold-minus-jax] @ git+https://github.com/sokrypton/ColabFold"\n  if [ -n "${TPU_NAME}" ]; then\n    pip install -q --no-warn-conflicts -U dm-haiku==0.0.10 jax==0.3.25\n  fi\n  ln -s /usr/local/lib/python3.*/dist-packages/colabfold colabfold\n  ln -s /usr/local/lib/python3.*/dist-packages/alphafold alphafold\n  # hack to fix TF crash\n  rm -f /usr/local/lib/python3.*/dist-packages/tensorflow/core/kernels/libtfkernel_sobol_op.so\n  touch COLABFOLD_READY\nfi\n\n# Download params (~1min)\npython -m colabfold.download\n\n# setup conda\nif [ ${USE_AMBER} == "True" ] || [ ${USE_TEMPLATES} == "True" ]; then\n  if [ ! -f CONDA_READY ]; then\n    wget -qnc https://github.com/conda-forge/miniforge/releases/download/25.3.1-0/Miniforge3-25.3.1-0-Linux-x86_64.sh\n    bash Miniforge3-25.3.1-0-Linux-x86_64.sh -bfp /usr/local 2>&1 1>/dev/null\n    rm Miniforge3-25.3.1-0-Linux-x86_64.sh\n    conda config --set auto_update_conda false\n    touch CONDA_READY\n  fi\nfi\n# setup template search\nif [ ${USE_TEMPLATES} == "True" ] && [ ! -f HH_READY ]; then\n  conda install -y -q -c conda-forge -c bioconda kalign2=2.04 hhsuite=3.3.0 python="${PYTHON_VERSION}" 2>&1 1>/dev/null\n  touch HH_READY\nfi\n# setup openmm for amber refinement\nif [ ${USE_AMBER} == "True" ] && [ ! -f AMBER_READY ]; then\n  conda install -y -q -c conda-forge openmm=8.2.0 python="${PYTHON_VERSION}" pdbfixer 2>&1 1>/dev/null\n  touch AMBER_READY\nfi\n'' returned non-zero exit status 1.

In [60]:
# ================== Run Prediction ==================
import sys, os, re, shutil
from pathlib import Path
from colabfold.batch import get_queries, run
from colabfold.download import default_data_dir
from colabfold.utils import setup_logging

!pip install pdbfixer
!pip install openmm

# 确保pdbfixer可以导入
if use_amber and f"/usr/local/lib/python{python_version}/site-packages/" not in sys.path:
    sys.path.insert(0, f"/usr/local/lib/python{python_version}/site-packages/")

setup_logging(Path(result_dir).joinpath("log.txt"))

# 获取输入序列
queries, is_complex = get_queries(input_dir)

# 运行预测，只取 top1
run(
    queries=queries,
    result_dir=result_dir,
    use_templates=use_templates,
    num_relax=num_relax,
    relax_max_iterations=relax_max_iterations,
    msa_mode=msa_mode,
    model_type="auto",
    num_models=1,                 # 🔑 只跑一个模型
    num_recycles=num_recycles,
    model_order=[1],
    is_complex=is_complex,
    data_dir=default_data_dir,
    keep_existing_results=False,
    rank_by="auto",
    pair_mode="unpaired",
    stop_at_score=stop_at_score,
    zip_results=False,
    user_agent="colabfold/google-colab-batch",
)

# ================== Rename Results ==================
fasta_file = list(Path(input_dir).glob("*.fasta"))[0]
with open(fasta_file, "r") as f:
    lines = f.readlines()

# 提取序列名和 Uniprot ID
seq_ids = [re.search(r"sp\|([^|]+)\|", line).group(1)
           for line in lines if line.startswith(">")]

# 给每个序列结果改名
for sid in seq_ids:
    seq_dir = Path(result_dir) / sid
    if seq_dir.exists():
        src_pdb = seq_dir / "ranked_0.pdb"
        dst_pdb = Path(result_dir) / f"{sid}.pdb"
        if src_pdb.exists():
            shutil.copy(src_pdb, dst_pdb)
            print(f"✅ Saved {dst_pdb}")
        else:
            print(f"⚠️ No ranked_0.pdb found for {sid}")

2025-09-09 08:24:50,493 Running on GPU
2025-09-09 08:24:50,496 Found 3 citations for tools or databases
2025-09-09 08:24:50,496 Query 1/8: A0A0C5B5G6 (length 16)
2025-09-09 08:24:58,197 Padding length to 26
2025-09-09 08:25:15,440 alphafold2_ptm_model_1_seed_000 recycle=0 pLDDT=64.8 pTM=0.0259
2025-09-09 08:25:32,903 alphafold2_ptm_model_1_seed_000 recycle=1 pLDDT=65.5 pTM=0.0262 tol=1.06
2025-09-09 08:25:33,088 alphafold2_ptm_model_1_seed_000 recycle=2 pLDDT=63.8 pTM=0.0265 tol=0.924
2025-09-09 08:25:33,267 alphafold2_ptm_model_1_seed_000 recycle=3 pLDDT=65.9 pTM=0.027 tol=1.03
2025-09-09 08:25:33,267 alphafold2_ptm_model_1_seed_000 took 35.1s (3 recycles)
2025-09-09 08:25:33,270 reranking models by 'plddt' metric


ModuleNotFoundError: No module named 'pdbfixer'

In [30]:
import os

for root, dirs, files in os.walk(result_dir):
    for f in files:
        if not f.endswith(".pdb"):  # 不是 pdb 的文件
            file_path = os.path.join(root, f)
            try:
                os.remove(file_path)
            except Exception as e:
              pass

In [29]:
import shutil

# 压缩整个 result 文件夹
shutil.make_archive("results", "zip", result_dir)

# 下载到本地
from google.colab import files
files.download("results.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
import zipfile
import glob
import os
from google.colab import files

zip_filename = "pdb_results.zip"

# 递归查找所有子文件夹里的 pdb
pdb_files = glob.glob(f"{result_dir}/**/*.pdb", recursive=True)
print(f"找到 {len(pdb_files)} 个 PDB 文件")

# 创建 zip，只保留文件名，不带路径
with zipfile.ZipFile(zip_filename, 'w') as zipf:
    for pdb in pdb_files:
        zipf.write(pdb, arcname=os.path.basename(pdb))

# 下载 zip
files.download(zip_filename)

In [None]:
#@title Run Prediction

import sys

from colabfold.batch import get_queries, run
from colabfold.download import default_data_dir
from colabfold.utils import setup_logging
from pathlib import Path

# For some reason we need that to get pdbfixer to import
if use_amber and f"/usr/local/lib/python{python_version}/site-packages/" not in sys.path:
    sys.path.insert(0, f"/usr/local/lib/python{python_version}/site-packages/")

setup_logging(Path(result_dir).joinpath("log.txt"))

queries, is_complex = get_queries(input_dir)
run(
    queries=queries,
    result_dir=result_dir,
    use_templates=use_templates,
    num_relax=num_relax,
    relax_max_iterations=relax_max_iterations,
    msa_mode=msa_mode,
    model_type="auto",
    num_models=num_models,
    num_recycles=num_recycles,
    model_order=[1, 2, 3, 4, 5],
    is_complex=is_complex,
    data_dir=default_data_dir,
    keep_existing_results=do_not_overwrite_results,
    rank_by="auto",
    pair_mode="unpaired+paired",
    stop_at_score=stop_at_score,
    zip_results=zip_results,
    user_agent="colabfold/google-colab-batch",
)

# Instructions <a name="Instructions"></a>
**Quick start**
1. Upload your single fasta files to a folder in your Google Drive
2. Define path to the fold containing the fasta files (`input_dir`) define an outdir (`output_dir`)
3. Press "Runtime" -> "Run all".

**Result zip file contents**

At the end of the job a all results `jobname.result.zip` will be uploaded to your (`output_dir`) Google Drive. Each zip contains one protein.

1. PDB formatted structures sorted by avg. pIDDT. (unrelaxed and relaxed if `use_amber` is enabled).
2. Plots of the model quality.
3. Plots of the MSA coverage.
4. Parameter log file.
5. A3M formatted input MSA.
6. BibTeX file with citations for all used tools and databases.


**Troubleshooting**
* Check that the runtime type is set to GPU at "Runtime" -> "Change runtime type".
* Try to restart the session "Runtime" -> "Factory reset runtime".
* Check your input sequence.

**Known issues**
* Google Colab assigns different types of GPUs with varying amount of memory. Some might not have enough memory to predict the structure for a long sequence.
* Google Colab assigns different types of GPUs with varying amount of memory. Some might not have enough memory to predict the structure for a long sequence.
* Your browser can block the pop-up for downloading the result file. You can choose the `save_to_google_drive` option to upload to Google Drive instead or manually download the result file: Click on the little folder icon to the left, navigate to file: `jobname.result.zip`, right-click and select \"Download\" (see [screenshot](https://pbs.twimg.com/media/E6wRW2lWUAEOuoe?format=jpg&name=small)).

**Limitations**
* Computing resources: Our MMseqs2 API can handle ~20-50k requests per day.
* MSAs: MMseqs2 is very precise and sensitive but might find less hits compared to HHblits/HMMer searched against BFD or Mgnify.
* We recommend to additionally use the full [AlphaFold2 pipeline](https://github.com/deepmind/alphafold).

**Description of the plots**
*   **Number of sequences per position** - We want to see at least 30 sequences per position, for best performance, ideally 100 sequences.
*   **Predicted lDDT per position** - model confidence (out of 100) at each position. The higher the better.
*   **Predicted Alignment Error** - For homooligomers, this could be a useful metric to assess how confident the model is about the interface. The lower the better.

**Bugs**
- If you encounter any bugs, please report the issue to https://github.com/sokrypton/ColabFold/issues

**License**

The source code of ColabFold is licensed under [MIT](https://raw.githubusercontent.com/sokrypton/ColabFold/main/LICENSE). Additionally, this notebook uses AlphaFold2 source code and its parameters licensed under [Apache 2.0](https://raw.githubusercontent.com/deepmind/alphafold/main/LICENSE) and  [CC BY 4.0](https://creativecommons.org/licenses/by-sa/4.0/) respectively. Read more about the AlphaFold license [here](https://github.com/deepmind/alphafold).

**Acknowledgments**
- We thank the AlphaFold team for developing an excellent model and open sourcing the software.

- Do-Yoon Kim for creating the ColabFold logo.

- A colab by Sergey Ovchinnikov ([@sokrypton](https://twitter.com/sokrypton)), Milot Mirdita ([@milot_mirdita](https://twitter.com/milot_mirdita)) and Martin Steinegger ([@thesteinegger](https://twitter.com/thesteinegger)).
