# GROMACS Resume -  Start a long GROMACS run that times out, and resume it

On the supercomputers, a maximum runtime of 24h is enforced. If a longer GROMACS run is needed, the run can easily be resumed. One of the outputs provided by the Rush `gmx` module is designed to be used as the input to the `gmx_resume` module, which will resume the run from the latest checkpoint. The outputs to this module are identical to those of the `gmx` module itself, so the run can be resumed as many times as necessary to finish it.

## 0.0) Imports

In [None]:
from pathlib import Path
import rush

In [None]:
# |hide
# Users won't generally create a workspace
# We nuke to ensure run is reproducible
import os

WORK_DIR = Path.home() / "qdx" / "tutorial-gmx-resume"

if WORK_DIR.exists():
    client = rush.Provider(workspace=WORK_DIR)
    await client.nuke(remote=False)

os.makedirs(WORK_DIR, exist_ok=True)

2024-03-21 14:35:57,255 - rush - INFO - Restoring by default via env


In [None]:
RUSH_TOKEN = os.getenv("RUSH_TOKEN") or "YOUR_TOKEN_HERE"
client = rush.build_blocking_provider_with_functions(access_token=RUSH_TOKEN)

2024-03-21 14:35:57,278 - rush - INFO - Restoring by default via env


In [None]:
# |hide
# We hide this because users will generally not set a workspace, and won't restore by default
client = rush.build_blocking_provider_with_functions(
    batch_tags=["tutorial-resume-gmx"],
    workspace=WORK_DIR,
)

2024-03-21 14:35:57,952 - rush - INFO - Restoring by default via env


## 0.1) Input Download and Selection

In [None]:
!pdb_fetch '1B39' | pdb_selchain -A | pdb_delhetatm > '1B39_A_nohet.pdb'

## 1) Input Preparation

In [None]:
_, prepared_protein_pdb = client.prepare_protein(
    Path.cwd() / "1B39_A_nohet.pdb"
)

2024-03-21 14:36:00,622 - rush - INFO - Trying to restore job with tags: ['tutorial-resume-gmx'] and path: github:talo/prepare_protein/947cdbc000031e192153a20a9b4a8fbb12279102#prepare_protein_tengu
2024-03-21 14:36:00,762 - rush - INFO - Restoring job from previous run with id 3e4e10dd-f7cf-4e33-b759-cc73d6ae11c9


## 2.1) Run GROMACS (modules: gmx, gmx_resume)
Next we will run a molecular dynamics simulation on our protein using gromacs via the `gmx` module.

We'll set `timeout_duration_mins = 1` so that the run times out before it finishes, and then resume via the `gmx_resume` module using the first output, which is the archive that contains all the necessary data for resuming the run from the last saved checkpoint.

We'll set `checkpoint_interval_mins = 1.0/60` so that the checkpointing takes place once per second.

We can restart as many times as we need to in order for the run to finish. Use a unique tag for each restarted run so that each sequential restart will be tagged and cached appropriately. See below for an example.

For each restarted run, please pass the same config file that was passed to the original `gmx` call. There is no support for shortening or extending runs, or changing any other run parameters, of runs that have already been started. Passing the same config ensures that progress is reported properly and that there are no other inconsistencies. So, the initial run config should specify the full desired run.

One current limitation is that the frame selection can only operate on the data generated by the last call to `gmx` or `gmx_restart`. Otherwise, the output xtc files from all the calls must be joined and processed manually.

In [None]:
gmx_config = {
    "params_overrides": {
        "nvt": {"nsteps": 2000},
        "npt": {"nsteps": 2000},
        "md": {"nsteps": 150000},
    },
    "frame_sel": {
        "start_time_ps": 290,
        "end_time_ps": 300,
        "delta_time_ps": 1,
    },
    "checkpoint_interval_mins": 1.0 / 60,
    "timeout_duration_mins": 1,
    "num_gpus": 1,
    "save_wets": False,
}

In [None]:
resume_files_first, streaming_outputs, static_outputs, *rest = client.gmx(
    None,
    prepared_protein_pdb,
    None,
    gmx_config,
    resources={"gpus": 1, "storage": 1, "storage_units": "GB"},
)

2024-03-21 14:36:00,773 - rush - INFO - Trying to restore job with tags: ['tutorial-resume-gmx'] and path: github:talo/tengu-gmx/04cff2931b995c33263dfdb477d7f09c8bbd75a7#gmx_tengu
2024-03-21 14:36:00,822 - rush - INFO - Restoring job from previous run with id f6d416cf-0cde-4bf1-8ccd-44c0e71aa0d9


In [None]:
# Check the progress and repeat this call as needed,
# incrementing the integer in the tag with each subsequent call
resume_files_second, _, _, xtc_dry, pdb_dry, _, _ = client.gmx_resume(
    resume_files_first,
    gmx_config,
    tags=["gmx-resume-1"],
    resources={"gpus": 1, "storage": 1, "storage_units": "GB"},
)

2024-03-21 14:36:00,826 - rush - INFO - Trying to restore job with tags: ['gmx-resume-1', 'tutorial-resume-gmx'] and path: github:talo/tengu-gmx/04cff2931b995c33263dfdb477d7f09c8bbd75a7#gmx_resume_tengu
2024-03-21 14:36:01,321 - rush - INFO - Restoring job from previous run with id d38f766a-1ef3-4e0f-bd07-951ae32200e2


In [None]:
resume_files_third, _, _, xtc_dry, pdb_dry, _, _ = client.gmx_resume(
    resume_files_second,
    gmx_config,
    tags=["gmx-resume-2"],
    resources={"gpus": 1, "storage": 1, "storage_units": "GB"},
)

2024-03-21 14:36:01,325 - rush - INFO - Trying to restore job with tags: ['gmx-resume-2', 'tutorial-resume-gmx'] and path: github:talo/tengu-gmx/04cff2931b995c33263dfdb477d7f09c8bbd75a7#gmx_resume_tengu
2024-03-21 14:36:01,388 - rush - INFO - Restoring job from previous run with id 4dca3d01-abaa-40db-ae6b-6d88f4123f8a


## Checking progress of GROMACS run
To determine if your GROMACS run is done, or if further runs are required, you can look at the progress output.
How to check this output is demonstrated below. 


This is an example progress event that indicates that the job has not completed, and will require resuming.
`n` is the number of execution steps in the GMX module. If `n` is less than `n_expected`, or `done` is false the run is not yet completed.


`gmx_resume_tengu progress: {
  "n": 600,
  "n_expected": 601,
  "n_max": 601,
  "done": false
}`

In [None]:
# Checking
result1 = client.module_instance_blocking(resume_files_first.source)
result2 = client.module_instance_blocking(resume_files_second.source)
result3 = client.module_instance_blocking(resume_files_third.source)

In [None]:
print(result1.progress)
print(result2.progress)
print(result3.progress)

n=139100 n_expected=154000 n_max=154000 done=False
n=154000 n_expected=154000 n_max=154000 done=True
n=154000 n_expected=154000 n_max=154000 done=True


We can see that we get around 50k steps a run, so we should launch another 2 resume jobs

In [None]:
resume_files_fourth, _, _, xtc_dry, pdb_dry, _, _ = client.gmx_resume(
    resume_files_third,
    gmx_config,
    tags=["gmx-resume-3"],
    resources={"gpus": 1, "storage": 1, "storage_units": "GB"},
)

2024-03-21 14:36:02,046 - rush - INFO - Trying to restore job with tags: ['gmx-resume-3', 'tutorial-resume-gmx'] and path: github:talo/tengu-gmx/04cff2931b995c33263dfdb477d7f09c8bbd75a7#gmx_resume_tengu
2024-03-21 14:36:02,094 - rush - INFO - Restoring job from previous run with id 517c9492-4442-4940-b1ca-f95ce022b27d


In [None]:
resume_files_fifth, _, _, xtc_dry, pdb_dry, _, _ = client.gmx_resume(
    resume_files_fourth,
    gmx_config,
    tags=["gmx-resume-4"],
    resources={"gpus": 1, "storage": 1, "storage_units": "GB"},
)

2024-03-21 14:36:02,098 - rush - INFO - Trying to restore job with tags: ['gmx-resume-4', 'tutorial-resume-gmx'] and path: github:talo/tengu-gmx/04cff2931b995c33263dfdb477d7f09c8bbd75a7#gmx_resume_tengu
2024-03-21 14:36:02,151 - rush - INFO - Restoring job from previous run with id f37e3fe9-5930-473c-91bb-fdcafb98715d


# Downloading Results
To download extracted frames and fetch their pdbs, we can do the following

In [None]:
pdb_dry.download("dry_frames.tar.gz")

Exception: (<ModuleFailureReason.RESOLUTION: 'RESOLUTION'>, ModuleInstanceCommonFailureContext(stdout=None, stderr=None, syserr='argument resolution failed due to rejected arguments'))

In [None]:
import tarfile

with tarfile.open(client.workspace / "objects" / "dry_frames.tar.gz", "r") as tf:
    selected_frame_pdbs = [
        tf.extractfile(member).read()
        for member in tf
        if "pdb" in member.name and member.isfile()
    ]
    for i, frame in enumerate(selected_frame_pdbs):
        with open(
            client.workspace / "objects" / f"gmx_output_frame_{i}.pdb", "w"
        ) as pf:
            print(frame.decode("utf-8"), file=pf)

In [None]:
with open(client.workspace / "objects" / "gmx_output_frame_0.pdb", "r") as f:
    print(str.join("", f.readlines()[0:10]))