# WorkGraph example to run Geometry Optimisation and Descriptors

## Aim

This notebook demonstrates how different types of tasks can be connected within a workflow. As an example, we start from a structure, optimize its geometry, compute descriptors, and then use a filtering function to split the resulting structures into `train.xyz`, `test.xyz`, and `valid.xyz`. The goal is to show how workflows can seamlessly combine CalcJobs (as in aiida-mlip) with a calcfunction, demonstrating the flexibility of chaining tasks together.

### Setup

The initial setup is very similar to the other tutorials, such as `singlepoint.ipynb`, which goes into more detail about what each step is doing.

We will need the `fpsample` dependancy to run sample_split.py, this can be installed as an optional dependancy from `pyproject.toml`.


Load the aiida profile and code:

In [None]:
from aiida import load_profile
load_profile()

In [None]:
from aiida_mlip.data.model import ModelData
uri = "https://github.com/stfc/janus-core/raw/main/tests/models/mace_mp_small.model"
model = ModelData.from_uri(uri, architecture="mace_mp", cache_dir="mlips")

In [None]:
from aiida.orm import load_code
janus_code = load_code("janus@localhost")

Inputs should include the model, code, metadata, and any other keyword arguments expected by the calculation we are running:

In [None]:
from aiida.orm import Str, Float, Bool
inputs = {
    "code": janus_code,
    "model": model,
    "arch": Str(model.architecture),
    "device": Str("cpu"),
    "metadata": {"options": {"resources": {"num_machines": 1}}},
}

We now load the calculations we want to run:

In [None]:
from aiida.plugins import CalculationFactory

geomoptCalc = CalculationFactory("mlip.opt")
descriptorsCalc = CalculationFactory("mlip.descriptors")

Now we can create our WorkGraph. This includes passing in the inputs, looping through and running the calculations on each structure.

For each input structure:
1. Run a geometry optimization.
   This returns an `xyz_output`, which is a StructureData object
   containing the optimized atomic positions and cell in XYZ format.
2. Pass the optimized `xyz_output` into a descriptors calculation.
   The descriptors job reads the structure and computes numerical features
   (fingerprints) for each structure.
3. Collect the descriptor outputs, as StructureData, for all structures
   and pass them to `process_and_split_data` (a calcfunction).
4. `process_and_split_data` writes the structures to `train.xyz`, `test.xyz`,
   and `valid.xyz` files, and returns a Dict node with the file paths.

In [None]:
from aiida.orm import Str, Float, Bool, Int
from ase.io import read
from aiida_workgraph import WorkGraph
from aiida.orm import StructureData
from sample_split import process_and_split_data

initail_structure = "../structures/lj-traj.xyz"
num_structs = len(read(initail_structure, index=":"))

with WorkGraph("Calculation Workgraph") as wg:
    final_structures = {}

    for i in range(num_structs):
        structure = StructureData(ase=read(initail_structure, index=i))

        geomopt_calc = wg.add_task(
            geomoptCalc,
            code=inputs['code'],
            model=inputs['model'],
            arch=inputs['arch'],
            device=inputs['device'],
            metadata=inputs['metadata'],
            fmax=Float(0.1),
            opt_cell_lengths=Bool(False),
            opt_cell_fully=Bool(True),
            struct=structure,
        )

        descriptors_calc = wg.add_task(
            descriptorsCalc,
            code=inputs['code'],
            model=inputs['model'],
            arch=inputs['arch'],
            device=inputs['device'],
            metadata=inputs['metadata'],
            struct=geomopt_calc.outputs.final_structure,
            calc_per_element=True,
        )

        final_structures[f"structs{i}"] = descriptors_calc.outputs.xyz_output

    split_task = wg.add_task(
        process_and_split_data,
        config_types= Str(""),
        n_samples=Int(num_structs),
        prefix= Str(""),
        scale= Float(1.0e5),
        append_mode= Bool(False),
        trajectory_data= final_structures
        )


Visualise the WorkGraph

In [None]:
wg


Run the tasks

In [None]:
wg.run()

We should get a dictionary with filepaths:

In [None]:
wg.tasks.process_and_split_data.outputs.result.value.get_dict()

We can use the outputs to visualise the data. For example, below we will plot a histogram of `mace_mp_descriptor`

In [None]:
test_file = wg.tasks.process_and_split_data.outputs.result.value.get_dict()["test_file"]
train_file = wg.tasks.process_and_split_data.outputs.result.value.get_dict()["train_file"]
valid_file = wg.tasks.process_and_split_data.outputs.result.value.get_dict()["valid_file"]

In [None]:
import numpy as np
from ase.io import iread
import matplotlib.pyplot as plt

test_mace_desc = np.array([i.info['mace_mp_descriptor'] for i in iread(test_file, index=':')])
train_mace_desc = np.array([i.info['mace_mp_descriptor'] for i in iread(train_file, index=':')])
valid_mace_desc = np.array([i.info['mace_mp_descriptor'] for i in iread(valid_file, index=':')])

all_values = np.concatenate([train_mace_desc, valid_mace_desc, test_mace_desc])
bins = np.linspace(all_values.min(), all_values.max(), len(all_values))

fig, ax = plt.subplots()

ax.hist([train_mace_desc, valid_mace_desc, test_mace_desc],
        bins=bins,
        label=["Train", "Valid", "Test"],
        color=["blue", "green", "red"],
        edgecolor="black",
        rwidth=0.9,
        histtype="bar")

ax.legend()
plt.show()