## tengu-py

Below we'll walk through the process of building and running a drug discovery workflow using tengu!

First, install the following modules via pip - we require Python > 3.10
```
pip install tengu-py pdb-tools
```

In [None]:
import json
import os
import sys
import tarfile
import base64

from pdbtools import *
import requests
from datetime import datetime
from pathlib import Path

import tengu

### 0) Setup

In [None]:
# Set our token - ensure you have exported TENGU_TOKEN in your shell; or just replace the os.getenv with your token
TOKEN = os.getenv("TENGU_TOKEN")
# You might have a custom deployment url, by default it will use https://tengu.qdx.ai
URL = os.getenv("TENGU_URL") or "https://tengu.qdx.ai"

In [None]:
# Define our project information
DESCRIPTION = "tengu-py demo notebook"
TAGS = ["qdx", "tengu-py", "demo", "cdk2", "atp", "caps"]
WORK_DIR = Path.home() / "qdx" / "tengu-py-demo"
OUT_DIR = WORK_DIR / "runs"
OUT_DIR.mkdir(parents=True, exist_ok=True)
MODULE_LOCK = WORK_DIR / "lock.json"

# Set our inputs
SYSTEM_PDB_PATH = WORK_DIR / "test.pdb"
PROTEIN_PDB_PATH = WORK_DIR / "test_P.pdb"
LIGAND_SMILES_STR = "c1nc(c2c(n1)n(cn2)[C@H]3[C@@H]([C@@H]([C@H](O3)CO[P@@](=O)(O)O[P@](=O)(O)OP(=O)(O)O)O)O)N"
LIGAND_PDB_PATH = WORK_DIR / "test_L.pdb"

In [None]:
# fetch datafiles
complex = list(pdb_fetch.fetch_structure("1B39"))
protein = pdb_delhetatm.remove_hetatm(pdb_selchain.select_chain(complex, "A"))
ligand = pdb_keepcoord.keep_coordinates(pdb_rplresname.rename_residues(pdb_selresname.filter_residue_by_name(complex, "ATP"), "ATP", "UNL"))
with open(SYSTEM_PDB_PATH, 'w') as f:
    for l in complex:
        f.write(str(l))
with open(PROTEIN_PDB_PATH, 'w') as f:
    for l in protein:
        f.write(str(l))
with open(LIGAND_PDB_PATH, 'w') as f:
    for l in ligand:
        f.write(str(l))        

In [None]:
# Get our client, for calling modules and using the tengu API
client = tengu.Provider(access_token=TOKEN, url=URL)

In [None]:
# Get our latest modules as a dict[module_name, module_path]
# If a lock file exists, load it so that the run is reproducable
if MODULE_LOCK.exists():
    modules = client.load_module_paths(MODULE_LOCK)
else: 
    modules = client.get_latest_module_paths()
    client.save_module_paths(modules, MODULE_LOCK)

In [None]:
modules["hermes_energy"]

  - `module_name` is a descriptive string and indicates the "function" the module is calling;
  - `module_path` is a versioned tengu "endpoint" for a module accessible via the client.

Using the same `module_path` string across multiple runs provides reproducibility.

The following is an example of how save and load frozen modules:
```python
frozen_modules_filepath = client.save_module_paths(modules)
frozen_modules = client.load_module_paths(frozen_modules_filepath)
assert(modules == frozen_modules)
```

You could save modules and provide a fixed string to load_module_paths:
```python
FROZEN_MODULES_FILEPATH = 'tengu-modules-20231006T132244.json'
frozen_modules = client.load_module_paths(FROZEN_MODULES_FILEPATH)
```

Below we'll call modules using `client.run2()`.

The parameters to `client.run2()` are as follows:

  - `module_path`: The endpoint of the module we'll be running;
  - `args`: A list of the arguments to the module; an argument can be one of the following:
    1. A `pathlib.Path` or a file-like object like `BufferedReader`, `FileIO`, `StringIO` etc.:  
         Loads the data in the file as an argument.  
         **NOTE**: The uploaded value isn't just the string of the file,
         so don't pass the string directly; pass the path or wrap in StringIO.
    2. An tengu ArgId returned by a previous call to `client.run2()`:  
         The `ArgId` type wraps data for use within tengu. It may refer to an object already
         uploaded to tengu storage, such as outputs of other run calls.  
         See below for more details. It's easier to understand when you see an example.
    3. A parameter, i.e. a value of any other type, including `None`:  
         Tengu modules take configs as json in the backend; we'll convert for you.  
         Just pass arguments directly, as per the schema for the module you're running.
  - `target`: The machine we want to run on (`NIX_SSH` for a cluster, `GADI` for a supercomputer).
  - `resources`: The resources to use on the target.
  - `tags`: Tags to associate with our run, so we can easily look up our runs.

The return value is a dict that contains:

  - key `"module_instance_id"` -> val is a `ModuleInstanceId` for the run itself;
  - key `"output_ids"` -> val is a list of `ArgId`s, one for each output.

Both of these ID types have the form of a UUID.
This ID lets you manipulate the output of this module without having to:

  1. Wait for the module to finish its computation, or
  2. Download the actual value corresponding to this output.

You can pass it to subsequent modules as if it were the value itself, or
you can wait on it to obtain the value itself.

> [!NOTE]  
> A coming improvement will provide explicit naming and type info for the
> inputs and outputs of each module, which will improve clarity and discoverability.

### 1.1) Prep the protein

In [None]:
pdb2pqr_result = client.run2(
    modules["prepare_protein_tengu"],
    [
        PROTEIN_PDB_PATH,
    ],
    target="NIX_SSH_3",
    resources={"gpus": 1, "storage": 1_024_000_000, "walltime": 15},
    tags=TAGS,
    restore = True
)
pdb2pqr_run_id = pdb2pqr_result.get("module_instance_id")
prepped_protein_id = pdb2pqr_result["output_ids"][0]
print(f"{datetime.now().time()} | Running protein prep!")

In [None]:
with open(OUT_DIR / f"01-pdb2pqr-{pdb2pqr_run_id}.json", "w") as f:
    json.dump(pdb2pqr_result, f, default=str, indent=2)

In [None]:
client.poll_module_instance(pdb2pqr_run_id)
client.download_object(prepped_protein_id, OUT_DIR / "01-prepped-protein.pdb")
print(f"{datetime.now().time()} | Downloaded prepped protein!")

### 1.2) Prep the ligand

In [None]:
ligand_prep_config = {
    "source": "",
    "output_folder": "./",
    "job_manager": "multiprocessing",
    "num_processors": -1,
    "max_variants_per_compound": 1,
    "thoroughness": 3,
    "separate_output_files": True,
    "min_ph": 6.4,
    "max_ph": 8.4,
    "pka_precision": 1.0,
    "skip_optimize_geometry": True,
    "skip_alternate_ring_conformations": True,
    "skip_adding_hydrogen": False,
    "skip_making_tautomers": True,
    "skip_enumerate_chiral_mol": True,
    "skip_enumerate_double_bonds": True,
    "let_tautomers_change_chirality": False,
    "use_durrant_lab_filters": True,
}
ligand_prep_result = client.run2(
    modules["prepare_ligand_tengu"],
    [
        LIGAND_SMILES_STR,
        LIGAND_PDB_PATH,
        ligand_prep_config,
    ],
    target="NIX_SSH_3",
    resources={"gpus": 1, "storage": 16 * 1024, "walltime": 5},
    tags=TAGS,
    restore=True
)
ligand_prep_run_id = ligand_prep_result["module_instance_id"]
prepped_ligand_pdb_id = ligand_prep_result["output_ids"][0]
prepped_ligand_sdf_id = ligand_prep_result["output_ids"][1]
print(f"{datetime.now().time()} | Running ligand prep!")

In [None]:
with open(OUT_DIR / f"01-prepare-ligand-{ligand_prep_run_id}.json", "w") as f:
    json.dump(ligand_prep_result, f, default=str, indent=2)

In [None]:
print("Checking ligand prep instance", ligand_prep_run_id)
client.poll_module_instance(ligand_prep_run_id)
client.download_object(prepped_ligand_pdb_id, OUT_DIR / "01-prepped-ligand.pdb")
client.download_object(prepped_ligand_sdf_id, OUT_DIR / "01-prepped-ligand.sdf")

print(f"{datetime.now().time()} | Downloaded prepped ligand!")

### 2) Run GROMACS (module: gmx_tengu / gmx_tengu_pdb)

In [None]:
gmx_config = {
    "param_overrides": {
        "md": [("nsteps", "5000")],
        "em": [("nsteps", "1000")],
        "nvt": [("nsteps", "1000")],
        "npt": [("nsteps", "1000")],
        "ions": [],
    },
    "num_gpus": 0,
    "num_replicas": 1,
    "ligand_charge": None,
    "frame_sel": {
        "begin_time": 1,
        "end_time": 10,
        "delta_time": 2,
    },
}
gmx_result = client.run2(
    # TODO: Should be using qdxf conformer verions of these modules
    modules["gmx_tengu_pdb"],
    [
        prepped_protein_id,
        prepped_ligand_pdb_id,
        gmx_config,
    ],
    target="NIX_SSH_3",
    resources={"gpus": 0, "storage": 1, "storage_units": "GB", "cpus": 48, "walltime": 60},
    tags=TAGS,
    restore=True
)
gmx_run_id = gmx_result["module_instance_id"]
gmx_output_id = gmx_result["output_ids"][0]
gmx_ligand_gro_id = gmx_result["output_ids"][3]
print(f"{datetime.now().time()} | Running GROMACS simulation!")

In [None]:
with open(OUT_DIR / f"02-gmx-{gmx_run_id}.json", "w") as f:
    json.dump(gmx_result, f, default=str, indent=2)

In [None]:
print("Fetching gmx results", gmx_run_id)
client.poll_module_instance(gmx_run_id, n_retries=60, poll_rate=60)
client.download_object(gmx_output_id, OUT_DIR / "02-gmx-output.tar.gz")
print(f"{datetime.now().time()} | Downloaded GROMACS output!")

In [None]:
# Extract the "dry" (i.e. non-solvated) frames we asked for
with tarfile.open(OUT_DIR / "02-gmx-output.tar.gz", "r") as tf:
    selected_frame_pdbs = [
        tf.extractfile(member).read()
        for member in sorted(tf, key=lambda m: m.name)
        if ("dry" in member.name and "pdb" in member.name)
    ]
    for i, frame in enumerate(selected_frame_pdbs):
        with open(OUT_DIR/f"02-gmx-output-frame{i}.pdb", "w") as pf:
            print(frame.decode("utf-8"), file=pf)
    gmx_gro_base64_str = base64.b64encode(tf.extractfile([f for f in tf if f.name == "outputs/ligand_in_GMX.gro"][0]).read()).decode("utf-8")
#client.download_object(gmx_ligand_gro_id, OUT_DIR / "02-gmx-ligand.gro")
pdb_base64_str = base64.b64encode(selected_frame_pdbs[0]).decode("utf-8")

### 3.1) Run quantum energy calculation (modules: qp_gen_inputs, hermes_energy, qp_collate)

In [None]:
# We have a helper function for this, as it combines 3 modules without much need
# to inspect the intermediate results.
(_, _, qp_result) = client.run_qp(
    modules["qp_gen_inputs"],
    modules["hermes_energy"],
    modules["qp_collate"],
    pdb=tengu.Arg(value=pdb_base64_str),
    gro=tengu.Arg(value=gmx_gro_base64_str),
    lig=tengu.Arg(id= str(prepped_ligand_sdf_id)),
    lig_type=tengu.Arg(value="sdf"),
    lig_res_id=tengu.Arg(value="UNL"),  # The ligand's residue code in the PDB file; this is what our prep uses
    use_new_fragmentation_method=False,
    target="NIX_SSH_3",
    resources={"storage": 1_024_000_000, "walltime": 600},
    tags=TAGS,
    restore=True
)
qp_run_id = qp_result["module_instance_id"]
qp_interaction_energy_id = qp_result["output_ids"][0]
print(f"{datetime.now().time()} | Running QP energy calculation!")

In [None]:
with open(OUT_DIR / f"03-qp-{qp_run_id}.json", "w") as f:
    json.dump(qp_result, f, default=str, indent=2)

In [None]:
client.poll_module_instance(qp_run_id)
client.download_object(qp_interaction_energy_id, OUT_DIR / "03-qp-interaction-energy.json")
print(f"{datetime.now().time()} | Downloaded qp interaction energy!")

### 3.2) Run MM-PBSA

In [None]:
mmpbsa_config = {
    "start_frame": 1,
    "end_frame": 2,
    "num_cpus": 1,  # cannot be greater than number of frames
}
mmpbsa_result = client.run2(
    modules["gmx_mmpbsa_tengu"],
    [
        gmx_output_id,
        mmpbsa_config,
    ],
    target="GADI",
    resources={"storage": 100, "storage_units": "MB", "walltime": 600},
    tags=TAGS,
    restore=True
)
mmpbsa_run_id = mmpbsa_result["module_instance_id"]
mmpbsa_output_id = mmpbsa_result["output_ids"][0]
print(f"{datetime.now().time()} | Running GROMACS MM-PBSA calculation!")

In [None]:
with open(OUT_DIR / f"03-mmpbsa-{mmpbsa_run_id}.json", "w") as f:
    json.dump(mmpbsa_result, f, default=str, indent=2)

In [None]:
client.poll_module_instance(mmpbsa_run_id)
client.download_object(mmpbsa_output_id, OUT_DIR / "03-mmpbsa-output.zip")
print(f"{datetime.now().time()} | Downloaded MM-PBSA results!")