# Analysis

## Assumed directory structure

```
example_directory
├── active_learning
│   ├── xyz
│   └── simulation.lammps
├── cp2k_input
│   └── template.inp
├── cp2k_output
├── lammps
│   └── template.lmp
├── n2p2
│   └── input.nn.template
├── qe
│   ├── pseuodos
│   │   └── ...
│   └── mcresol-T300-p1.xyz
├── scripts
│   ├── cp2k.ipynb
│   ├── data_pruning.ipynb
│   ├── quantum_espresso.ipynb
│   ├── workflow.ipynb
│   └── visualise.ipynb
├── validation
├── xyz
└── reference.history
```

While functions allow for filepaths to be specified, the default arguments will assume the above directory structure, and will read and write to locations accordingly.

Another aspect of how the code handles paths is the formatting of file names when creating multiple files with a regular naming pattern. For example, as only a single trajectory is expected this is given with a full file name (e.g. `'example_trajectory.history'`) but the individual frames should contain a pair of braces to allow formatting (e.g. `'xyz/{}.xyz'`).

Finally, in some cases "template" files can be used which contain details that are not needed to be routinely changed as part of the workflow, and are not dependent on the structures being dealt with. To change these, simply modify the template files. 

The majority of file management and high level commands are called via the `Controller` object. This stores information about the directory structure, location of executables and the properties of the atoms in question. The latter in turn uses `Species` and `Structure` objects to store general information about the systems of interest, with specific configurations of atoms being represented by a `Dataset` and its constituent `Frames`.

In [None]:
# Executables, filepaths, etc.
main_directory = '..'
n2p2_sub_directories = ['n2p2']
lammps_sub_directory = 'lammps'
n2p2_bin = '/path/to/n2p2/bin'
lammps_executable = '/path/to/lammps/build/lmp_mpi'
n2p2_module_commands = [
    'export OPENBLAS_NUM_THREADS=1'
]
slurm_constraint = "constraint"

In [None]:
from cc_hdnnp.controller import Controller
from cc_hdnnp.structure import AllStructures, Species, Structure

# Create objects for all elements in the structure
H = Species(
    symbol='H',
    atomic_number=1,
    mass=1.00794,
    valence=1,
    min_separation={"H": 0.8, "C": 0.8, "O": 0.8},
)
C = Species(
    symbol='C',
    atomic_number=6,
    mass=12.011,
    min_separation={"H": 0.8, "C": 0.8, "O": 0.8},
    valence=4,
)
O = Species(
    symbol='O',
    atomic_number=8,
    mass=15.9994,
    min_separation={"H": 0.8, "C": 0.8, "O": 0.8},
    valence=6
)

# Define a name for the Structure which has the above constituent elements
# Information used for active learning, such as the energy and force tolerances is also defined here
all_species = [H, C, O]
structure = Structure(name='mcresol', all_species=all_species, delta_E=1e-4, delta_F=1e-2)
all_structures = AllStructures(structure)

controller = Controller(
    structures=all_structures,
    main_directory=main_directory,
    n2p2_sub_directories=n2p2_sub_directories,
    lammps_sub_directory=lammps_sub_directory,
    n2p2_bin=n2p2_bin,
    lammps_executable=lammps_executable,
    n2p2_module_commands=n2p2_module_commands,
)

## 1. Compare distances

It can be useful to calculate the distances between structures in two data files. For `structure_indicies = [[a, b], [c, d]]`, structures a - b inclusive from the first input file are compared to structures c - d inclusive from the second input file. The results are saved in text files of the form `file_out.i`, where i ranges from c to d, which can then be combined into a single `file_out`.

In [None]:
from cc_hdnnp.dataset import Dataset

files_in = ["../traj_1.xyz", "../traj_2.xyz"]
formats=["extxyz", "extxyz"]
file_out = "../distance_calcs/distances.csv"

In [None]:
controller.write_distance_script(
    files_in=files_in,
    formats=formats,
    structure_indicies=[[0, 99], [0, 9]],
    file_out=file_out,
    permute=True,
    ntasks_per_node=16,
    constraint=slurm_constraint,
    export="PATH,OPENBLAS_NUM_THREADS",
)

The script to compare distances can then be run as follows:

In [None]:
!sbatch calc_distances.sh

The output files can then be combined into a single file:

In [None]:
controller.combine_distance_files(
    file_in=files_in[1],
    format=formats[1],
    indicies=[0, 9],
    file_out=file_out,
)

The results can then be imported and analysed:

In [None]:
import numpy as np
from numpy import genfromtxt
import matplotlib.pyplot as plt

distances = np.genfromtxt(file_out, delimiter=',')
distances = distances.reshape(99, 9)

plt.hist(distances, bins=20)
plt.show()