# Dataset Generation (CP2K)

This is a example workflow using CP2K as to generate the training data in sections 1 to 5. The alternative methods, using Quantum Espresso, can be found [here](quantum_espresso.ipynb).

## Assumed directory structure

```
example_directory
├── active_learning
│   ├── xyz
│   └── simulation.lammps
├── cp2k_input
│   └── template.inp
├── cp2k_output
├── lammps
│   └── template.lmp
├── n2p2
│   └── input.nn.template
├── qe
│   ├── pseuodos
│   │   └── ...
│   └── mcresol-T300-p1.xyz
├── scripts
│   ├── cp2k.ipynb
│   ├── data_pruning.ipynb
│   ├── quantum_espresso.ipynb
│   ├── workflow.ipynb
│   └── visualise.ipynb
├── validation
├── xyz
└── reference.history
```

In [None]:
# Executables and filepaths
main_directory = '..'
n2p2_bin = '/path/to/n2p2/bin'
lammps_executable = '/path/to/lammps/build/lmp_mpi'
basis_set_directory = "/path/to/cp2k/data/BASIS_MOLOPT"
potential_directory = "/path/to/cp2k/data/GTH_POTENTIALS"
cp2k_module_commands = [
    'module use ...',
    'module load ...',
]
slurm_constraint = "constraint"

basis_set_dict = {
    "H": "DZVP-MOLOPT-SR-GTH-q1",
    "C": "DZVP-MOLOPT-SR-GTH-q4",
    "O": "DZVP-MOLOPT-SR-GTH-q6",
}
potential_dict = {
    "H": "GTH-HCTH407-q1",
    "C": "GTH-HCTH407-q4",
    "O": "GTH-HCTH407-q6",
}

In [None]:
from cc_hdnnp.controller import Controller
from cc_hdnnp.structure import AllStructures, Species, Structure

# Create objects for all elements in the structure
H = Species(symbol='H', atomic_number=1, mass=1.00794)
C = Species(symbol='C', atomic_number=6, mass=12.011)
O = Species(symbol='O', atomic_number=8, mass=15.9994)

# Define a name for the Structure which has the above constituent elements
# Information used for active learning, such as the energy and force tolerances is also defined here
all_species = [H, C, O]
structure = Structure(
    name='mcresol', all_species=all_species, delta_E=1e-4, delta_F=1e-2
)
all_structures = AllStructures(structure)

controller = Controller(
    structures=all_structures,
    main_directory=main_directory,
    n2p2_bin=n2p2_bin,
    lammps_executable=lammps_executable,
    cp2k_module_commands=cp2k_module_commands,
)

## 1. Generate atomic configurations
There are no utility scripts for the generation of configurations, however a full trajectory can be converted into individual frames by:

In [None]:
controller.read_trajectory(file_trajectory='reference.history')
controller.write_xyz(file_xyz='xyz/{}.xyz')

## 2. Write CP2K
Both batch scripts and input files for CP2K can be generated by: 

In [None]:
controller.write_cp2k(
    structure_name="mcresol",
    basis_set_directory=basis_set_directory,
    potential_directory=potential_directory,
    basis_set_dict=basis_set_dict,
    potential_dict=potential_dict,
    file_xyz='xyz/{}.xyz',
    n_config=1,
    cutoff=(400, 600, 800),
    relcutoff=(40, 60, 80),
    constraint=slurm_constraint,
)

In this case, we have generated 9 input files and 9 batch scripts by specifying 3 values for `cutoff` and `relcutoff`. These can then be used to determine the best values for these settings (that balances accuracy with time taken).

## 3. Run CP2K
The previous step should output `bash ../scripts/all.sh`. This bash script will submit all the batch scripts which will submit Slurm jobs for all 9 cases.

In [None]:
!bash all.sh


## 4. Choose (rel)cutoff
To extract the useful information from the CP2K output, the following function will print a table comparing energy, time taken and grid allocation:

In [None]:
controller.print_cp2k_table(
    structure_name="mcresol",
    n_config=1,
    cutoff=(400,
    600, 800),
    relcutoff=(40, 60, 80),
)

Once the best value is chosen, to run CP2K with more frames (larger `n_config`) repeat steps 2. and 3. but with different arguments:

In [None]:
controller.write_cp2k(
    structure_name="mcresol",
    basis_set_directory=basis_set_directory,
    potential_directory=potential_directory,
    basis_set_dict=basis_set_dict,
    potential_dict=potential_dict,
    file_xyz='xyz/{}.xyz',
    n_config=1,
    cutoff=600,
    relcutoff=60,
    constraint=slurm_constraint,
)

In [None]:
!bash all.sh

## 5. Write N2P2
Once force and energy values are obtained from CP2K, these can be written to the N2P2 data format. The structure name should match one of the structures in `all_structures`:

In [None]:
controller.write_n2p2_data(
    structure_name="mcresol",
    file_cp2k_out='cp2k_output/mcresol_n_{}_cutoff_600_relcutoff_60.log',
    file_cp2k_forces='cp2k_output/mcresol_n_{}_cutoff_600_relcutoff_60-forces-1_0.xyz',
    file_xyz='xyz/{}.xyz',
    file_n2p2_input='input.data',
    n_config=1)