# MMseqs2 and Colabfold — sequence to 3D protein structure prediction using AlphaFold

## 0) Quickstart
See the quickstart notebook for a short example of how to go from sequence to 3D protein conformer.   

# 1) Setup

## 1.0) Imports

In [1]:
import os
from pathlib import Path
import rush
import json
import requests
import py3Dmol

## 1.1) Configuration

In [2]:
# Define our project information
DESCRIPTION = "colabfold notebook"
TAGS = ["qdx", "rush-py-v2", "demo", "colabfold"]
WORK_DIR = Path.home() / "qdx" / "colabfold"

In [3]:
# |hide
if WORK_DIR.exists():
    client = rush.Provider(workspace=WORK_DIR)
    await client.nuke(remote=False)
os.makedirs(WORK_DIR, exist_ok=True)   
os.chdir(WORK_DIR) 
YOUR_TOKEN = os.getenv("RUSH_TOKEN")

## 1.2) Build your client
Initialize our rush client and fetch available modules.

In [4]:
# Get our client, for calling modules and using the rush API
os.environ["RUSH_TOKEN"] = YOUR_TOKEN

client = rush.build_blocking_provider_with_functions(
    workspace=WORK_DIR,
    batch_tags=TAGS,
    access_token=YOUR_TOKEN
)

## 2.0) Get a sample FASTA sequence
Here, we have a FASTA sequence already, but you can source your FASTA sequences from anywhere, and save them in your workspace directory.

We need to perform some pre-processing of the FASTA sequence -- stripping the comment line.

We then load it to a string so we can pass it to mmseqs2 to produce a MSA from the initial amino acid sequence.

In [5]:
FASTA_SEQUENCE = """
MALGELKDDDFEKISELGAGNGGVVFKVSHKPSGLVMARKLIHLEIKPAIRNQIIRELQVLHECNSPYIVGFYGAFYSDGEISICMEHMDGGSLDQVLKKAGRIPEQILGKVSIAVIKGLTYLR
EKHKIMHRDVKPSNILVNSRGEIKLCDFGVSGQLIDEMANEFVGTRSYMSPERLQGTHYSVQSDIWSMGLSLVEMAVGRYPRPPMAIFELLDYIVNEPPPKLPSAVFSLEFQDFVNKCLIKNPAE
RADLKQLMVHAFIKRSDAEEVDFAGWLCSTIGLNQPSTPTHAAGEGHHHHHH"""
FASTA_SEQUENCE = FASTA_SEQUENCE.replace("\n", "")

with open('mek1.fasta', 'w') as file:
    # Use print function to write text to the file
    print(FASTA_SEQUENCE, file=file)

In [6]:
def load_file_to_string(file_path):
    try:
        with open(file_path, 'r') as file:
            return file.read()
    except FileNotFoundError:
        print(f"The file {file_path} was not found.")
        return None
    except Exception as e:
        print(f"An error occurred while reading the file: {e}")
        return None

In [7]:
fasta_sequence = load_file_to_string("mek1.fasta").strip()
fasta_sequence

'MALGELKDDDFEKISELGAGNGGVVFKVSHKPSGLVMARKLIHLEIKPAIRNQIIRELQVLHECNSPYIVGFYGAFYSDGEISICMEHMDGGSLDQVLKKAGRIPEQILGKVSIAVIKGLTYLREKHKIMHRDVKPSNILVNSRGEIKLCDFGVSGQLIDEMANEFVGTRSYMSPERLQGTHYSVQSDIWSMGLSLVEMAVGRYPRPPMAIFELLDYIVNEPPPKLPSAVFSLEFQDFVNKCLIKNPAERADLKQLMVHAFIKRSDAEEVDFAGWLCSTIGLNQPSTPTHAAGEGHHHHHH'

## 2.1) Run mmseqs2 on your FASTA sequence to get a sequence alignment
All outputs of Tengu modules are tuples, so we need to destructure the tuple first.
Please note that `mmseqs2` requires significant computational resources and access to a storage mount on the GADI supercomputer containing the sequence database.

In [None]:
(msa,) = client.mmseqs2(
    {'fasta': [fasta_sequence]}, resources={'gpus': 0, "cpus": 48, "mem": 128 * 1024, 
    "storage_mounts": ["gdata/if89"], 'walltime': 360}, target="GADI"
)

## 2.2) - Run AlphaFold2 to get our predicted conformer
Note here that we were able to pass our MSAs directly to the fold module.


In [None]:
(fold_output,) = client.colabfold_fold(msa, resources={"gpus": 1, "storage_mounts": ["gdata/if89"]}, target="GADI")

In [14]:
folded_conformers_url = fold_output.get()
folded_conformers = requests.get(folded_conformers_url).json()
# spot check the first 10 elements of our folded conformer
folded_conformers[0]['topology']['symbols'][0:10]

['N', 'C', 'C', 'C', 'O', 'C', 'S', 'C', 'N', 'C']


## 3.0) - Visualise a predicted conformer
Here, we can visualise one of the outputted conformers within our notebook so that we can review the quality of our protein prediction.
Note that we need to convert the predicted conformer first to a PDB as py3Dmol can only visualise PDB files.

In [None]:
with open('folded_conformer.json', 'w') as file:
    # Use print function to write text to the file
    print(json.dumps(folded_conformers[0]), file=file)
    
(output_pdb,) = client.to_pdb(client.workspace / "folded_conformer.json")

output_pdb.download(filename="folded_conformer.pdb")

In [17]:
view = py3Dmol.view()
with open(client.workspace / "folded_conformer.pdb", "r") as f:
    view.addModel(f.read(), "pdb")
    view.setStyle({"cartoon": {"color": "spectrum"}})
    view.zoomTo()
    view.show()