# Workflow

## Assumed directory structure

```
example_directory
├── active_learning
│   ├── xyz
│   └── simulation.lammps
├── cp2k_input
│   └── template.inp
├── cp2k_output
├── lammps
│   └── template.lmp
├── n2p2
│   └── input.nn.template
├── qe
│   ├── pseuodos
│   │   └── ...
│   └── mcresol-T300-p1.xyz
├── scripts
│   ├── cp2k.ipynb
│   ├── data_pruning.ipynb
│   ├── quantum_espresso.ipynb
│   ├── workflow.ipynb
│   └── visualise.ipynb
├── validation
├── xyz
└── reference.history
```

While functions allow for filepaths to be specified, the default arguments will assume the above directory structure, and will read and write to locations accordingly.

Another aspect of how the code handles paths is the formatting of file names when creating multiple files with a regular naming pattern. For example, as only a single trajectory is expected this is given with a full file name (e.g. `'example_trajectory.history'`) but the individual frames should contain a pair of braces to allow formatting (e.g. `'xyz/{}.xyz'`).

Finally, in some cases "template" files can be used which contain details that are not needed to be routinely changed as part of the workflow, and are not dependent on the structures being dealt with. To change these, simply modify the template files. 

The majority of file management and high level commands are called via the `Controller` object. This stores information about the directory structure, location of executables and the properties of the atoms in question. The latter in turn uses `Species` and `Structure` objects to store general information about the systems of interest, with specific configurations of atoms being represented by a `Dataset` and its constituent `Frames`.

In [None]:
# Executables, filepaths, etc.
main_directory = '..'
n2p2_sub_directories = ['n2p2']
lammps_sub_directory = 'lammps'
n2p2_bin = '/path/to/n2p2/bin'
lammps_executable = '/path/to/lammps/build/lmp_mpi'
n2p2_module_commands = [
    'module use ...',
    'module load ...',
]
slurm_constraint = "constraint"

In [None]:
from cc_hdnnp.controller import Controller
from cc_hdnnp.structure import AllStructures, Species, Structure

# Create objects for all elements in the structure
H = Species(
    symbol='H',
    atomic_number=1,
    mass=1.00794,
    valence=1,
    min_separation={"H": 0.8, "C": 0.8, "O": 0.8},
)
C = Species(
    symbol='C',
    atomic_number=6,
    mass=12.011,
    min_separation={"H": 0.8, "C": 0.8, "O": 0.8},
    valence=4,
)
O = Species(
    symbol='O',
    atomic_number=8,
    mass=15.9994,
    min_separation={"H": 0.8, "C": 0.8, "O": 0.8},
    valence=6
)

# Define a name for the Structure which has the above constituent elements
# Information used for active learning, such as the energy and force tolerances is also defined here
all_species = [H, C, O]
structure = Structure(name='mcresol', all_species=all_species, delta_E=1e-4, delta_F=1e-2)
all_structures = AllStructures(structure)

controller = Controller(
    structures=all_structures,
    main_directory=main_directory,
    n2p2_sub_directories=n2p2_sub_directories,
    lammps_sub_directory=lammps_sub_directory,
    n2p2_bin=n2p2_bin,
    lammps_executable=lammps_executable,
    n2p2_module_commands=n2p2_module_commands,
)

## 1. Generate dataset
Either [Quantum Espresso](quantum_espresso.ipynb) or [CP2K](cp2k.ipynb) can be used to generate energy, force and charge values for an input trajectory. See the individual notebooks for details.

## 2. N2P2
Once force and energy values are obtained, and written to the N2P2 data format, the rest of N2P2 can be set up prior to training.

### Symmetry Functions
Multiple different symmetry functions can be written to the same network input file, for example both shifted and centered versions of the radial, wide and narrow functions:

In [None]:
controller.write_n2p2_nn(
    file_nn_template='input.nn.template',
    file_nn='input.nn',
    r_cutoff=12.0,
    type='radial',
    rule='imbalzano2018',
    mode='center',
    n_pairs=5
)
controller.write_n2p2_nn(
    file_nn_template='input.nn.template',
    file_nn='input.nn',
    r_cutoff=12.0,
    type='angular_narrow',
    rule='imbalzano2018',
    mode='center',
    n_pairs=5,
    zetas=[1]
)
controller.write_n2p2_nn(
    file_nn_template='input.nn.template',
    file_nn='input.nn',
    r_cutoff=12.0,
    type='angular_wide',
    rule='imbalzano2018',
    mode='center',
    n_pairs=5,
    zetas=[1]
)
controller.write_n2p2_nn(
    file_nn_template='input.nn.template',
    file_nn='input.nn',
    r_cutoff=12.0,
    type='radial',
    rule='imbalzano2018',
    mode='shift',
    n_pairs=5
)
controller.write_n2p2_nn(
    file_nn_template='input.nn.template',
    file_nn='input.nn',
    r_cutoff=12.0,
    type='angular_narrow',
    rule='imbalzano2018',
    mode='shift',
    n_pairs=5,
    zetas=[1]
)
controller.write_n2p2_nn(
    file_nn_template='input.nn.template',
    file_nn='input.nn',
    r_cutoff=12.0,
    type='angular_wide',
    rule='imbalzano2018',
    mode='shift',
    n_pairs=5,
    zetas=[1]
)

### Scale, normalise and prune
Before training, the input data can optionally be normalised. This will apply headers in the relevant n2p2 files, but the other values in `input.data` will remain unchanged. Additionally, the symmetry functions must be "scaled", and in order to make the training process less expensive they can also be "pruned". Those with a low range across the `input.data` are deemed to be less desirable than those that vary a lot, and are commented out of `input.nn`.

Both the script for these pre-training steps andthe training itself are generated from one function taking many optional arguments.

In [None]:
controller.write_n2p2_scripts(range_threshold=1e-4, ntasks_per_node=1, constraint=slurm_constraint)

The preparation required before training can then be run as follows:

In [None]:
!sbatch n2p2_prepare.sh

### Train network
Provided there are an acceptable number of symmetry functions after pruning (if not re-run with a higher or lower threshold) the network can now be trained.

In [None]:
!sbatch n2p2_train.sh

### Weights selection
Once training is finished (either by completing all epochs or reaching the time limit) a set of weights should be chosen. Either a specific epoch can be chosen, or an epoch can be automatically chosen in order to minimise one of the errors calculated as a metric during the training process. If both `epoch` and `minimum_criterion` are `None` then the most recent epoch will be chosen by default.

Note that since multiple networks are required for the active learning workflows, the index of the directory in question should also be specified.

In [None]:
epoch = None
minimum_criterion = None
# minimum_criterion = "RMSEpa_Etest_pu"
# minimum_criterion = "MAEpa_Etest_pu"
# minimum_criterion = "RMSE_Ftest_pu"
# minimum_criterion = "MAE_Ftest_pu"
file_out = "weights.{0:03d}.data"
controller.choose_weights(
    n2p2_directory_index=0,
    epoch=epoch,
    minimum_criterion=minimum_criterion,
    file_out=file_out,
)

## 3. LAMMPS Validation

Once the network is trained it can be used in LAMMPS to run MD simulations. An existing `.xyz` file can be used with the `write_lammps_data` function, or a `Dataset` object can be written in the `"lammps-data"` format.

In [None]:
controller.write_lammps_data(file_xyz='xyz/0.xyz', lammps_unit_style='metal')

In [None]:
from cc_hdnnp.dataset import Dataset
dataset = Dataset(data_file=f"{main_directory}/{n2p2_sub_directories[0]}/input.data")
dataset.write(
    file_out=f"{main_directory}/{lammps_sub_directory}/lammps.data",
    format="lammps-data",
    conditions=(i == 0 for i in range(len(dataset))),
)

### Extrapolations
N2P2 automatically produces warnings when the network is extrapolating out of the range it was trained with, and will abort the MD if enough are produced. To see how many of these are produced in different conditions, scripts for a range of ensembles and temperatures can be produced, run, and analysed:

In [None]:
controller.write_extrapolations_lammps_script(
    n2p2_directory_index=0, temperatures=range(290, 310), constraint=slurm_constraint,
)

In [None]:
!sbatch lammps_extrapolations.sh

In [None]:
controller.analyse_extrapolations(temperatures=range(290, 310))

### RDF Validation
The dumps files generated by the extrapolations tests (or otherwise) can also have their RDF compared to that of the original trajectory used in dataset generation. This first requires conversion into pdb and xyz formats for use with the external aml package.

In [None]:
from ase.io import write, read
from aml.score.rdf import run_rdf_test
from aml.score import load_with_cell

filepath_ref = "reference.history"
filepath_net = f"../{lammps_sub_directory}/nve_t290_xyz_dump.lammpstrj"

controller.read_trajectory(filepath_ref)
write("../validation/ref_pos.xyz", controller.trajectory)
write("../validation/ref.pdb", controller.trajectory)

lammps_dump = read(filepath_net, format="lammps-dump-text", index=":")
write("../validation/net_pos.xyz", lammps_dump)
write("../validation/net.pdb", lammps_dump)

traj = load_with_cell("../validation/ref_pos.xyz", top="../validation/ref.pdb")
traj_net = load_with_cell("../validation/net_pos.xyz", top="../validation/net.pdb")
run_rdf_test(traj, traj_net)

## 4. Active Learning
It is likely that the initial reference structures/energies used for training do not fully describe the system. By training a second network on the same data, active learning can be used to extend the reference structures and energies in regions where the two networks do not agree. Assuming there are two such networks in directories in `../n2p2_1` and `../n2p2_2`, the first step is to generate the necessary LAMMPS input files:

In [None]:
from cc_hdnnp.active_learning import ActiveLearning

active_learning_sub_directory = 'active_learning'

al_controller = Controller(
    structures=all_structures,
    main_directory=main_directory,
    n2p2_bin=n2p2_bin,
    lammps_executable=lammps_executable,
    n2p2_sub_directories=["n2p2_1", "n2p2_2"],
    n2p2_module_commands=n2p2_module_commands,
    active_learning_sub_directory=active_learning_sub_directory,
)

a = ActiveLearning(data_controller=al_controller)
a.write_lammps(temperatures=[300])

Then run LAMMPS using the appropriate batch script:

In [None]:
!sbatch active_learning_lammps.sh

The trajectories generated by LAMMPS are pre-analysed and where appropriate reduced, before writing the new configurations to be considered to file:

In [None]:
a.prepare_lammps_trajectory()
a.prepare_data_new(constraint=slurm_constraint, ntasks_per_node=1)

Then run the NNs using the appropriate batch script to evaluate the energies for this data:

In [None]:
!sbatch active_learning_nn.sh

Using the energy evaluations of the NNs, the configurations to add to the training set can be determined by:

In [None]:
a.prepare_data_add()

This generates the file `input.data-add` in the active learning directory, however we still need to generate reference energies (as we have so far only evaluated the NN). This is done in the same manner as in section 1, but first requires converting into the xyz format (the exact method will depend on whether [Quantum Espresso](quantum_espresso.ipynb) or [CP2K](cp2k.ipynb) was used).

For CP2K this should result in multiple files, with each filename containing a frame index:

In [None]:
al_controller.convert_active_learning_to_xyz('input.data-add', f"{active_learning_sub_directory}/xyz/{{}}.xyz")

For QE, a single file containing the name of the `Structure` and indications of the temperature and pressure should be included:

In [None]:
al_controller.convert_active_learning_to_xyz(
    file_n2p2_data='input.data-add',
    file_xyz=f"{active_learning_sub_directory}/xyz/mcresol-T300-p1.xyz",
    single_output=True
)

Assuming that the `input.data` file already exists, then this will append the active learning structures to the existing file. Then the training can be restarted with a wider selection of data to ensure a more applicable model. However, it is worth noting that the scaling/normalisation process will need to be re-done. To remove the outdated normalisation header:

In [None]:
al_controller.remove_n2p2_normalisation()

## 5. Dataset Manipulation
Following the active learning, it may be that the increased dataset is no longer practical to use due to some outlying values or simply by being too large to fit into memory. There are a few different methods of reducing its size.

Firstly, structures with neighbouring atoms within a specified minimum seperation can be removed. This is done during the active learning process, but can also be done after the fact if a higher threshold is desired: 

In [None]:
from cc_hdnnp.dataset import Dataset

for species in structure.all_species:
    species.min_separation = {"H": 0.9, "C": 0.9, "O": 0.9}

dataset = Dataset(
    data_file=f"../{active_learning_sub_directory}/mode2/HDNNP_1/input.data",
    all_structures=AllStructures(structure),
)
dataset.write(
    file_out=f"../{active_learning_sub_directory}/mode2/HDNNP_1/input.data.nearest_neighbours",
    conditions=dataset.check_min_separation_all()
)

Secondly, a threshold in energy and/or force values can be set. Care should be taken over the units used here: both `energy_threshold` and `force_threshold` should be in the same units as those expressed in the `Dataset`. Also, either a single float or a tuple of floats can be given for `energy_threshold`. The former is taken as `(-energy_threshold, energy_threshold)` and so is only suitable when using normalised units with a mean of 0. As forces are always expected to have a symmetric distribution about zero, only a single float is supported.

In [None]:
dataset.write(
    file_out=f"../{active_learning_sub_directory}/mode2/HDNNP_1/input.data.outliers",
    conditions=dataset.check_threshold_all(
        energy_threshold=(-1150, -1100), force_threshold=1
    )
)

More complicated methods of pruning the dataset can be found in [Data Pruning](data_pruning.ipynb).