# Data Pruning

As the active learning progresses, the dataset of reference atomic configurations can grow prohibitively large, especially if a large number of symmetry functions are used. There are a number of methods available to reduce the amount of data whilst attempting to retain representation of the structure in question.

## Assumed directory structure

```
example_directory
├── n2p2
│   ├── atomic-env.G.data
│   ├── input.data
│   ├── input.nn
│   └── scaling.data
└── scripts
    ├── cp2k.ipynb
    ├── data_pruning.ipynb
    ├── quantum_espresso.ipynb
    ├── workflow.ipynb
    └── template.sh
```

In order to run data selection, `"input.data"`, `"input.nn"` and `"atomic-env.G"` must be present. The latter can be generated by running `nnp-atomenv`, be care should be taken as for large data files and large numbers of symmetry functions, this can generate very large files that can then be difficult to hold in memory.

While the network need not be trained prior to pruning the data, it is advisable to run pruning (e.g. `nnp-prune range 1.0E-4`) beforehand, to ensure that symmetry functions with negligible values are excluded.

In [None]:
# Executables and filepaths
main_directory = '..'
# n2p2_bin = '/path/to/n2p2/bin'
n2p2_bin = '/home/vol00/scarf860/cc_placement/n2p2/bin'
# lammps_executable = '/path/to/lammps/build/lmp_mpi'
lammps_executable = '/home/vol00/scarf860/cc_placement/lammps/build/lmp_mpi'
# qe_module_commands = [
#     'module use ...',
#     'module load ...',
# ]

In [None]:
%matplotlib inline
from cc_hdnnp.controller import Controller
from cc_hdnnp.structure import AllStructures, Species, Structure
import cc_hdnnp.visualisation as vis

# Create objects for all elements in the structure
H = Species(
    symbol='H',
    atomic_number=1,
    mass=1.00794,
    valence=1,
    min_separation={"H": 0.8, "C": 0.8, "O": 0.8},
)
C = Species(
    symbol='C',
    atomic_number=6,
    mass=12.011,
    min_separation={"H": 0.8, "C": 0.8, "O": 0.8},
    valence=4,
)
O = Species(
    symbol='O',
    atomic_number=8,
    mass=15.9994,
    min_separation={"H": 0.8, "C": 0.8, "O": 0.8},
    valence=6
)

# Define a name for the Structure which has the above constituent elements
# Information used for active learning, such as the energy and force tolerances is also defined here
all_species = [H, C, O]
structure = Structure(name='mcresol', all_species=all_species, delta_E=1e-4, delta_F=1e-2)
all_structures = AllStructures(structure)

controller = Controller(
    structures=all_structures,
    main_directory=main_directory,
    n2p2_bin=n2p2_bin,
    lammps_executable=lammps_executable
)

## 1. CUR decomposition

As described by [Imbalzano et al. (2018)](https://arxiv.org/abs/1804.02150), this method of feature selection results in features that were present in the original dataset, as opposed to the features normally returned by SVD which could not then be used in the machine learning workflow. Specifically, the features returned need to be a symmetry function or single frame of the `"input.data"` file.

In [None]:
from cc_hdnnp.data_selection import Decomposer

decomposer = Decomposer(data_controller=controller)


### Symmetry functions
The first mode of operation reduces the number of symmetry functions present in `"input.nn"` from some intentionally large set. This set can be generated in the usual manner (see [Workflow](workflow.ipynb)) either using Imbalzano's initial criteria or another method. A key feature of this approach is the weighting of symmetry functions, so that multiple functions that are rarely evaluated (i.e. with low cutoff and density of contributing atoms) may be selected in favour of one that is evaluated commonly, even if the latter has a greater contribution. This can be controlled with the `weight` argument.

Furthermore, multiple sets of symmetry functions can be output from a single run by providing a list to `n_to_select_list` and `file_out_list`. This can be useful to determine an appropriate trade off between model accuracy and time taken to train.

In [None]:
decomposer.run_CUR_symf(
    n_to_select_list=[64, 128, 256],
    weight=True,
    file_out_list=["input.nn.64", "input.nn.128", "input.nn.256"]
)

### Datapoints

Alternatively, the number of datapoints in the `"input.data"` file can be reduced. Here, a datapoint is a whole atomic configuration representing the structure, so will include several atomic positions. The evaluations performed in `"atomic-env.G"` are done for each atom and each symmetry function relevant to its element. This means before selection, the vector of symmetry functions for each element present are averaged across the structure, and concatenated to give a a representation for each structure that has a length of `N_A + N_B + N_C ...` where there are `N_A` symmetry functions for element `A` and so on.

In [None]:
decomposer.run_CUR_data(
    n_to_select_list=[64, 128, 256],
    file_out_list=["input.data.64", "input.data.128", "input.data.256"]
)

## 2. Separator
An alternative method for reducing the size of `"input.data"` is to remove atomic configurations that have a small Euclidean distance between them in terms of the vector of symmetry functions for all their atoms.

As this results in comparing a large parameter space (`n_frames_to_propose * n_frames_to_compare * n_atoms ** 2 * n_symf`), it can be necessary to compare the points in batches, with the proposed frames with greatest distance selected in favour of those that are closer to frames already selected. By default, it is the mean distance that is used as a criteria, but a float corresponding to a quantile can also be used (e.g. `0.5` for the median distance).

Finally, to avoid removing the most extreme points in the dataset (and so increasing the number of extrapolation warnings) `select_extreme_frames` can be set. This starts by selecting datapoints that give rise to the most extreme values in the dataset.

In [None]:
from cc_hdnnp.data_selection import Separator

separator = Separator(data_controller=controller)
separator.run_separation_selection(
    n_frames_to_select=8, n_frames_to_propose=16, n_frames_to_compare=16, select_extreme_frames=True,
)

## 3. Clustering

While not strictly a method for reducing the data, clustering similar atomic environments can be useful for visualisation. Clustering can be performed on atomic environments, per element, so visualise the different environments they inhabit. Alternatively, the global frame environment can be compared to see how similar different frames in the dataset are, particularly if data is added in batches during the active learning.

In [None]:
from cc_hdnnp.data_selection import Clusterer

clusterer = Clusterer(data_controller=controller)
clusterer.run_atom_clustering()
clusterer.run_frame_clustering()

In [None]:
vis.plot_clustering(elements=["H", "C", "O"])
vis.plot_clustering(elements=["all"])