# Example Workflow

## Assumed directory structure

```
example_directory
├── active_learning
│   └── simulation.lammps
├── cp2k_input
│   └── example_template.inp
├── cp2k_output
├── lammps
│   └── template.lmp
├── n2p2
│   └── input.nn.template
├── scripts
│   ├── example.ipynb
│   └── template.sh
├── xyz
└── example_trajectory.history
```

While functions allow for filepaths to be specified, the default arguments will assume the above directory structure, and will read and write to locations accordingly.

Another aspect of how the code handles paths is the formatting of file names when creating multiple files with a regular naming pattern. For example, as only a single trajectory is expected this is given with a full file name (e.g. `'example_trajectory.history'`) but the individual frames should contain a pair of braces to allow formatting (e.g. `'xyz/{}.xyz'`).

Finally, there is a reliance on "template" files which contain details that are not needed to be configured between different frames etc. To change these, simply modify the template files.

The majority of file management commands are called via the `Data` object. This stores information about the directory structure, location of executables and the properties of the atoms in question. The latter in turn uses `Species` and `Structure` objects to store information.

In [1]:
from cc_hdnnp.data import Data
from cc_hdnnp.structure import AllSpecies, AllStructures, Species, Structure

# Create objects for all elements in the structure
H = Species(symbol='H', atomic_number=1, mass=1.00794)
C = Species(symbol='C', atomic_number=6, mass=12.011)
O = Species(symbol='O', atomic_number=8, mass=15.9994)

# Define a name for the Structure which has the above constituent elements
# Information used for active learning, such as the energy and force tolerances is also defined here
all_species = AllSpecies(H, C, O)
structure = Structure(name='mcresol', all_species=all_species, delta_E=1e-4, delta_F=1e-2)
all_structures = AllStructures(structure)

main_directory = '../../m_cresol/all_frames'
n2p2_bin = '/home/vol00/scarf860/cc_placement/n2p2/bin'
lammps_executable = '/home/vol00/scarf860/cc_placement/lammps/build/lmp_mpi'

d = Data(
    structures=all_structures,
    main_directory=main_directory,
    n2p2_bin=n2p2_bin,
    lammps_executable=lammps_executable,
    n2p2_sub_directory='n2p2_AL_only'
)

## 1. Generate atomic configurations
There are no utility scripts for the generation of configurations, however a full trajectory can be converted into individual frames by:

In [None]:
d.read_trajectory(file_trajectory='example_trajectory.history')
d.write_xyz(file_xyz='xyz/{}.xyz')

## 2. Write CP2K
Both batch scripts and input files for CP2K can be generated by: 

In [None]:
# TODO Currently the cp2k commands are not written by the python script, but should be present in
# `file_batch`
d.write_cp2k(file_batch='scripts/cp2k_batch_{}.bash',
             file_input='cp2k_input/example_{}.inp',
             file_xyz='xyz/{}.xyz',
             n_config=1,
             cutoff=(400, 600, 800),
             relcutoff=(40, 60, 80))

In this case, we have generated 9 input files and 9 batch scripts by specifying 3 values for `cutoff` and `relcutoff`. These can then be used to determine the best values for these settings (that balances accuracy with time taken).

## 3. Run CP2K
The previous step should output `bash ../scripts/all.bash`. This bash script will submit all the batch scripts which will submit Slurm jobs for all 9 cases.

In [None]:
!bash ../scripts/all.bash


## 4. Choose (rel)cutoff
To extract the useful information from the CP2K output, the following function will print a table comparing energy, time taken and grid allocation:

In [None]:
d.print_cp2k_table(n_config=1, cutoff=(400, 600, 800), relcutoff=(40, 60, 80))

Once the best value is chosen, to run CP2K with more frames (larger `n_config`) repeat steps 2. and 3. but with different arguments:

In [None]:
d.write_cp2k(file_batch='scripts/cp2k_batch_{}.bat',
             file_input='cp2k_input/example_{}.inp',
             file_xyz='xyz/{}.xyz',
             n_config=101,
             cutoff=(600),
             relcutoff=(60))

In [None]:
!bash ../scripts/all.bash

## 5. Write N2P2
Once force and energy values are obtained from CP2K, these can be written to the N2P2 data format. The structure name should match one of the structures in `all_structures`:

In [None]:
d.write_n2p2_data(
    structure_name="mcresol",
    file_cp2k_out='cp2k_output/example_n_{}_cutoff_600_relcutoff_60.log',
    file_cp2k_forces='cp2k_output/example_n_{}_cutoff_600_relcutoff_60-forces-1_0.xyz',
    file_xyz='xyz/{}.xyz',
    file_n2p2_input='input.data',
    n_config=101)

Multiple different symmetry functions can be written to the same network input file, for example both shifted and centered versions of the radial, wide and narrow functions:

In [None]:
d.write_n2p2_nn(file_template='input.nn.template',
                file_nn='input.nn',
                r_cutoff=12.0,
                type='radial',
                rule='imbalzano2018',
                mode='center',
                n_pairs=5)
d.write_n2p2_nn(file_template='input.nn.template',
                file_nn='input.nn',
                r_cutoff=12.0,
                type='angular_narrow',
                rule='imbalzano2018',
                mode='center',
                n_pairs=5,
                zetas=[1])
d.write_n2p2_nn(file_template='input.nn.template',
                file_nn='input.nn',
                r_cutoff=12.0,
                type='angular_wide',
                rule='imbalzano2018',
                mode='center',
                n_pairs=5,
                zetas=[1])
d.write_n2p2_nn(file_template='input.nn.template',
                file_nn='input.nn',
                r_cutoff=12.0,
                type='radial',
                rule='imbalzano2018',
                mode='shift',
                n_pairs=5)
d.write_n2p2_nn(file_template='input.nn.template',
                file_nn='input.nn',
                r_cutoff=12.0,
                type='angular_narrow',
                rule='imbalzano2018',
                mode='shift',
                n_pairs=5,
                zetas=[1])
d.write_n2p2_nn(file_template='input.nn.template',
                file_nn='input.nn',
                r_cutoff=12.0,
                type='angular_wide',
                rule='imbalzano2018',
                mode='shift',
                n_pairs=5,
                zetas=[1])

## 6. Scale and prune symmetry functions
Before training, the input data can optionally be normalised. This will apply headers in the relevant n2p2 files, but the other values in `input.data` will remain unchanged. Additionally, the symmetry functions must be "scaled", and in order to make the training process less expensive they can also be "pruned". Those with a low range across the `input.data` are deemed to be less desirable than those that vary a lot, and are commented out of `input.nn`.

## 7. Train network
Provided there are an acceptable number of symmetry functions after pruning (if not step 6 can be re-run with a higher or lower threshold) the network can now be trained.

The batch scripts for steps 6 and 7 are generated by the following:


In [None]:
d.write_n2p2_scripts(range_threshold=1e-4)

In [None]:
!sbatch ../scripts/n2p2_prune.bat
!sbatch ../scripts/n2p2_train.bat

The most recent weights (those from the last epoch) are copied and renamed to the format `weights.<atomic_number>.data`. If for whatever reason a different epoch is desired, then the files should be renamed manually.

## 8. Active Learning
It is likely that the initial reference structures/energies used for training do not fully describe the system. By training a second network on the same data, active learning can be used to extend the reference structures and energies in regions where the two networks do not agree. Assuming there are two such networks in directories in `../n2p2_1` and `../n2p2_2`, the first step is to generate the necessary LAMMPS input files:

In [None]:
from active_learning import ActiveLearning
a = ActiveLearning(data_controller=d, n2p2_directories=['../n2p2_1', '../n2p2_2'])
a.write_lammps()

Then run LAMMPS using the appropriate batch script:

In [None]:
!sbatch ../scripts/active_learning_lammps.sh

The trajectories generated by LAMMPS are pre-analysed and where appropriate reduced, before writing the new configurations to be considered to file:

In [None]:
a.prepare_lammps_trajectory()
a.prepare_data_new()

Then run the NNs using the appropriate batch script to evaluate the energies for this data:

In [None]:
!sbatch ../scripts/active_learning_nn.sh

Using the energy evaluations of the NNs, the configurations to add to the training set can be determined by:

In [None]:
a.prepare_data_add()

If the configurations in `input.data-add` seem reasonable, this can be added to the existing data in the n2p2 folders with:

In [None]:
a.combine_data_add()

Then the training can be restarted with a wider selection of data to ensure a more applicable model.

## 9. Write LAMMPS
To set up LAMMPS with data from an existing `.xyz` file, the `write_lammps_data` functon can be used. The interaction is defined by `write_lammps_pair`, which creates a LAMMPS input file based on the template provided:

In [None]:
d.write_lammps_data(file_xyz='xyz/0.xyz', lammps_unit_style='metal')
d.write_lammps_pair(r_cutoff=6.351,
                    file_template='lammps/template.lmp',
                    file_out='lammps/md.lmp',
                    n2p2_directory='n2p2',
                    lammps_unit_style='metal')

## 10. Run LAMMPS
Finally, LAMMPS can be run using the the neural network potential defining the interactions.