# Running a Multi-Step Workflow with a Remote System

## Aim

This notebook demonstrates how to connect and execute different types of tasks within a single workflow, including running parts of the workflow on an external computer, or tasks which require dynamic inputs. As an example, we begin with an initial structure, perform `geometry optimization`, compute `descriptors`, and run a script to split the structures into training files. We then run `Quantum Espresso` on an external system (SCARF) to calculate energies, and finally use the output files to train a model. The goal is to illustrate how to design and execute workflows that seamlessly integrate local and remote tasks. 

### Setup

For this tutorial we will assume you have: 
<ul>
        <li>A AiiDA profile setup</li>
        <li>An external computer setup in AiiDA with a quantum espresso code</li>
                <ul>
                        <li>A tutorial can be found in <code>../aiida_setup/setup-external-computer.ipynb</code></li>
                </ul>
        <li>The <code>aiida-quantumespresso</code>, <code>aiida-pseudo</code> and <code>fpsample</code> extra dependancies installed</li>
        <li>Pseudopotentails SSSP installed</li>
                <ul>
                        <li>They can be installed with: <code>aiida-pseudo install sssp</code></li>
                </ul>

</ul>

The initial setup is very similar to the other tutorials, such as `singlepoint.ipynb`, which goes into more detail about what each step is doing

Load the aiida profile, model and the code:

In [None]:
from aiida import load_profile
load_profile()

In [None]:
from aiida_mlip.data.model import ModelData
uri = "https://github.com/stfc/janus-core/raw/main/tests/models/mace_mp_small.model"
model = ModelData.from_uri(uri, architecture="mace_mp")

This time we also load the code which executes the process on the external computer, `qe@scarf`.

In [None]:
from aiida.orm import load_code

janus_code = load_code("janus@localhost")
qe_code = load_code("qe@scarf")

We must now choose the calculations to perform:

In [None]:
from aiida.plugins import CalculationFactory
geomoptCalc = CalculationFactory("mlip.opt")
descriptorsCalc = CalculationFactory("mlip.descriptors")
trainCalc = CalculationFactory("mlip.train")

Before setting up the work graph, we first configure the `Quantum Espresso (QE)` task by defining the code and input parameters. Since we need to run QE on multiple structures, we create multiple `PwCalculation` tasks dynamically within the same task using `get_current_graph()`. This allows us to run QE for each structure and return the corresponding `TrajectoryData` and parameters for each.

In [None]:
from aiida_workgraph import task
from aiida_workgraph.manager import get_current_graph
from aiida.orm import StructureData, load_group, KpointsData, SinglefileData
from ase.io import iread
from pathlib import Path
import yaml
from aiida_quantumespresso.calculations.pw import PwCalculation
from sample_split import process_and_split_data


@task.graph(outputs = ["test_file", "train_file", "valid_file"])
def qe(**inputs):

    wg = get_current_graph()

    task_inputs = inputs["task_params"]['task_inputs']
    code =inputs["task_params"]["code"]

    kpoints = KpointsData()
    kpoints.set_kpoints_mesh(task_inputs['kpoint_mesh'])

    pseudo_family = load_group('SSSP/1.3/PBE/efficiency')
    files = {"test_file": inputs['test_file'],"train_file":inputs['train_file'],"valid_file":inputs['valid_file']}

    for file_name, file in files.items():
        with file.as_path() as path:
            for i, structs in enumerate(iread(path, format="extxyz")):
                
                structure = StructureData(ase=structs)
                pseudos = pseudo_family.get_pseudos(structure=structure)

                ecutwfc, ecutrho = pseudo_family.get_recommended_cutoffs(
                    structure=structure,
                    unit='Ry',
                )

                pw_params = {
                    "CONTROL": {
                        "calculation": "scf",
                        'tprnfor': True,
                        'tstress': True,
                    },
                    "SYSTEM": {
                        "ecutwfc": ecutwfc,
                        "ecutrho": ecutrho,
                    },
                }
                
                qe_task = wg.add_task(
                    PwCalculation,
                    code = code,
                    parameters= pw_params,
                    kpoints= kpoints,
                    pseudos= pseudos,
                    metadata= task_inputs["metadata"],
                    structure= structure,
                )
                
                structfile = f"{file_name}.struct{i}"

                wg.update_ctx({
                    structfile:{
                        "trajectory":qe_task.outputs.output_trajectory,
                        "parameters": qe_task.outputs.output_parameters
                    }
                })

    return {
        "test_file": wg.ctx.test_file,
        "train_file": wg.ctx.train_file,
        "valid_file": wg.ctx.valid_file
    }    

The next task we need is a function which can extract the required parameters from the QE tasks and create the files for training. This task creates `mlip_[file]_file.extxyz` and returns a `JanusConfigfile` which is used for the training calculations.

In [None]:
from aiida_mlip.data.config import JanusConfigfile
from aiida.orm import Dict
from ase.io import write
from ase import units

@task.calcfunction(outputs = ["JanusConfigfile"])
def create_train_file(**inputs):

    training_files = {}
    
    for file_name, structs in inputs.items():
        path = Path(f"mlip_{file_name}.extxyz")

        for stuct_out_params in structs.values():
            
            trajectory = stuct_out_params["trajectory"]

            fileStructure = trajectory.get_structure(index=0)
            fileAtoms = fileStructure.get_ase()

            stress = trajectory.arrays["stress"][0]
            converted_stress = stress * units.GPa
            fileAtoms.info["qe_stress"] = converted_stress

            fileAtoms.info["units"] = {"energy": "eV","forces": "ev/Ang","stress": "ev/Ang^3"}
            fileAtoms.set_array("qe_forces", trajectory.arrays["forces"][0])

            parameters = stuct_out_params["parameters"]
            fileParams = parameters.get_dict()
            fileAtoms.info["qe_energy"] = fileParams["energy"]
            write(path, fileAtoms, append=True)

        training_files[file_name] = str(path.resolve())

    with open("JanusConfigfile.yml", "a") as f:
        yaml.safe_dump(training_files, f, sort_keys=False)

    return{'JanusConfigfile': JanusConfigfile(Path("JanusConfigfile.yml").resolve())}

For this task, we are using a task to run a pure python function. This is to demonstrate the flexibility of tasks and how you can run python functions on the workchain. This returns `SinglefileData` instances of the test, train and valid files.

In [None]:
@task.calcfunction(outputs = ["test_file", "train_file", "valid_file"])
def create_aiida_files(**inputs):
     
    files = process_and_split_data(**inputs)

    return {
        "train_file": SinglefileData(files["train_file"]),
        "test_file": SinglefileData(files["test_file"]),
        "valid_file": SinglefileData(files["valid_file"])
    }

We initialize the inputs we want for all the calculations. These variables can be changed depending on the configuration you are running and whether you want to change any inputs.

In [None]:
from aiida.orm import Str, Float, Bool, Int, List

calc_inputs = {
    "code": janus_code,
    "model": model,
    "arch": Str(model.architecture),
    "device": Str("cpu"),
    "metadata": {"options": {"resources": {"num_machines": 1}}},
}

goemopt_inputs = {
    "fmax": Float(0.1),
    "opt_cell_lengths": Bool(False),
    "opt_cell_fully": Bool(True),
}

split_task_inputs = {
    "config_types": Str(""),
    "prefix": Str(""),
    "scale": Float(1.0e5),
    "append_mode": Bool(False),
}

qe_inputs = {
    "task_inputs": Dict({
        "metadata": {
            "options": {
                "resources": {
                    "num_machines": 1,
                    "num_mpiprocs_per_machine": 32,
                },
                "max_wallclock_seconds": 3600,
                "queue_name": "scarf",
                "qos": "scarf",
                "environment_variables": {},
                "withmpi": True,
                "prepend_text": """
                    module purge
                    module use /work4/scd/scarf562/eb-common/modules/all
                    module load amd-modules
                    module load QuantumESPRESSO/7.2-foss-2023a
                """,
                "append_text": "",
            },
        },
        "kpoint_mesh": List([1, 1, 1]),
    }),
    "code": qe_code,
}

Now we can build the `Workgraph`. First we iterate through each structure in the initail structure file, and run `Geomopt` and `Descriptors` on them these give a `SinglefileData` instance of the structure outputs. These structures can then be passed to the `split_task`, which splits these structures up into training files. Then we run `QE` task, getting the outputs and passing them into the `training_files` task which, as the name suggests, it creates the training file from the `QE` task outputs. Finally we can run the training script. Ideally, if any of the inputs need to changed, they should be done in the cell above.

In [None]:
from aiida_workgraph import WorkGraph
from aiida.orm import StructureData
from ase.io import iread

initial_structure = "../structures/NaCl-traj.xyz"

with WorkGraph("QE Calculation Workgraph") as wg:

    final_structures = {}

    for i, struct in enumerate(iread(initial_structure)):
        structure = StructureData(ase=struct)
        
        geomopt_calc = wg.add_task(
            geomoptCalc,
            **calc_inputs,
            **goemopt_inputs,
            struct=structure,
        )
        
        descriptors_calc = wg.add_task(
            descriptorsCalc,
            **calc_inputs,
            struct=geomopt_calc.outputs.final_structure,
            calc_per_element=True,
        )

        final_structures[f"structs{i}"] = descriptors_calc.outputs.xyz_output

    split_task = wg.add_task(
        create_aiida_files, 
        **split_task_inputs,
        trajectory_data=final_structures,
        n_samples= Int(len(final_structures)),
        )
    
    qe_task = wg.add_task(
        qe, 
        name="QE_workflow",
        test_file= split_task.outputs.test_file,
        train_file= split_task.outputs.train_file,
        valid_file= split_task.outputs.valid_file,
        task_params = qe_inputs
    )

    training_files = wg.add_task(
        create_train_file, 
        test_file= qe_task.outputs.test_file,
        train_file= qe_task.outputs.train_file,
        valid_file= qe_task.outputs.valid_file,
        )

    train_task = wg.add_task(
        trainCalc,
        mlip_config=training_files.outputs.JanusConfigfile,
        code=calc_inputs["code"],
        foundation_model=calc_inputs["model"],
        metadata=calc_inputs["metadata"],
        fine_tune=True,
    )

Run and visualise the workgraph

In [None]:
wg

In [None]:
wg.run()

If we want to get the training plot, we have to pull it from the remote folder.  

In [None]:
import os
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

folder = wg.tasks.Train.outputs.remote_folder.value
picturePath = f"{os.getcwd()}/traingraph.png"
folder.getfile(relpath='results/test_run-123_train_Default_stage_one.png',destpath=picturePath)

img = mpimg.imread(picturePath)
plt.imshow(img)