# Data Loading and Selection

Welcome to the first IPS example notebook!
Here we will demonstrate how to load existing datasets and perform some simple and more involved data splitting workflows.

All examples are self-contained, and the data is created within the notebooks themselves.

In [1]:
import ipsuite as ips
from zntrack.utils import cwd_temp_dir

temp_dir = cwd_temp_dir()

import ipsuite as ips

import os
from ase import units
from ase.calculators.emt import EMT
from ase.io.trajectory import TrajectoryWriter
from ase.lattice.cubic import FaceCenteredCubic
from ase.md.velocitydistribution import MaxwellBoltzmannDistribution
from ase.md.langevin import Langevin
from ase.visualize import view


2023-05-31 23:10:07,791 (DEBUG): Welcome to IPS - the Interatomic Potential Suite!


In [2]:
!git init
!dvc init

hint: Using 'master' as the name for the initial branch. This default branch name
hint: is subject to change. To configure the initial branch name to use in all
hint: 
hint: 	git config --global init.defaultBranch <name>
hint: 
hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
hint: 'development'. The just-created branch can be renamed via this command:
hint: 
hint: 	git branch -m <name>
Initialized empty Git repository in /tmp/tmpok_da3k5/.git/
Initialized DVC repository.

You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+--------------------------

## Data Creation

First, we will create some sample data using ASE to perform a short molecular dynamics simulation.

TODO make data and run MD
TODO combining multiple datasets

In [3]:
size = 3

# Set up a crystal
atoms = FaceCenteredCubic(
    directions=[[1, 0, 0], [0, 1, 0], [0, 0, 1]],
    symbol='Cu',
    size=(size, size, size),
    pbc=True
)


In [4]:
timestep = 5 * units.fs
steps = 100
temperature = 800
traj_path = os.path.join(temp_dir.name, "trajectory.traj")


atoms.calc = EMT()
MaxwellBoltzmannDistribution(atoms, temperature_K=temperature)

dyn = Langevin(atoms, timestep, temperature_K=temperature, friction=0.002)

writer = TrajectoryWriter(traj_path, "w", atoms=atoms)
dyn.attach(writer, interval=1)

dyn.run(steps)

True

## Data Loading

IPS uses ASE for many of its internals and datasets can be loaded from any ASE compatible format.
Here we are going to pretend that the above created sample data is some literature dataset that we have already downloaded.

In [5]:
with ips.Project() as project:
    trajectory = ips.AddData(file=traj_path, name="trajectory")
project.run()



Running DVC command: 'stage add --name trajectory --force ...'
Running DVC command: 'repro'


2023-05-31 23:10:19,001 (DEBUG): Welcome to IPS - the Interatomic Potential Suite!


Reading File: 101it [00:00, 3355.28it/s]


The data is read from disk and is now available to both other Nodes in the project and for use in the notebook.

In [6]:
trajectory.load() # requires the project to have been run

We obtain a list of ASE Atoms, with which we can work in our notebook and use other ASE functionality, e.g.:

In [None]:
view(trajectory)

The H5MD standard offers substantial size and I/O speed advantages.
For this reason H5MD is used by Nodes which serialize atomistic data, including `AddData`.

## Data Selection

A common way to split data into training, validation and test splits is to randomly partition them into fractions like 75:15:10.
We can add the respective selection Nodes to our existing project.

In [7]:
with project:
    random_test_selection = ips.configuration_selection.RandomSelection(data=trajectory, n_configurations=10, name="random_test_selection")
    random_val_selection = ips.configuration_selection.RandomSelection(data=random_test_selection.excluded_atoms, n_configurations=15, name="random_val_selection")
    random_train_selection = ips.configuration_selection.RandomSelection(data=random_val_selection.excluded_atoms, n_configurations=75, name="random_train_selection")
project.run()

Running DVC command: 'stage add --name trajectory --force ...'
Running DVC command: 'stage add --name random_test_selection --force ...'
Running DVC command: 'stage add --name random_val_selection --force ...'
Running DVC command: 'stage add --name random_train_selection --force ...'
Running DVC command: 'repro'


2023-05-31 23:10:20,899 (DEBUG): Welcome to IPS - the Interatomic Potential Suite!
2023-05-31 23:10:21,047 (DEBUG): Selecting from 101 configurations.
2023-05-31 23:10:22,301 (DEBUG): Welcome to IPS - the Interatomic Potential Suite!
2023-05-31 23:10:22,491 (DEBUG): Selecting from 91 configurations.
2023-05-31 23:10:23,804 (DEBUG): Welcome to IPS - the Interatomic Potential Suite!
2023-05-31 23:10:24,044 (DEBUG): Selecting from 76 configurations.


Selecting the testing data first means we can change our training and validation selection, e.g. by using a different selection method or a different number of configurations, without altering our test set.
Note that we had to give names to our selection nodes since we use multiple instances of that Node in our graph.
For convenience, it is also possible to assign numerical IDs by supplying `automatic_node_names=True` to the `Project`.

We can visualize our workflow at any time using

In [8]:
!dvc dag

      +------------+       
      | trajectory |       
      +------------+       
             *             
             *             
             *             
+-----------------------+  
| random_test_selection |  
+-----------------------+  
             *             
             *             
             *             
 +----------------------+  
 | random_val_selection |  
 +----------------------+  
             *             
             *             
             *             
+------------------------+ 
| random_train_selection | 
+------------------------+ 


While straightforward, this is not necessarily advisable for molecular dynamics data.
Trajectories are created sequentially, and a random split will result in validation samples that are drawn from between training samples.
By instead separating the dataset into fixed fraction first, we can ensure that the splits do not temporally overlap.

For demonstration purposes, we will delete the random splitting workflow we created above with `remove_existing_graph=True`.

In [9]:
with ips.Project(remove_existing_graph=True) as project:
    trajectory = ips.AddData(file=traj_path, name="trajectory")
    test_split = ips.configuration_selection.SplitSelection(data=trajectory, split=0.1, name="test_split")
    val_split = ips.configuration_selection.SplitSelection(data=test_split.excluded_atoms, split=0.17, name="val_split") # 0.15 / 0.9 * 1.0 \approx 0.17
    train_split = val_split.excluded_atoms # 0.8 of the total data

    test_data = ips.configuration_selection.UniformTemporalSelection(data=test_split, n_configurations=10, name="test_data")
    val_data = ips.configuration_selection.UniformTemporalSelection(data=val_split, n_configurations=15, name="val_data")
    train_data = ips.configuration_selection.UniformEnergeticSelection(data=train_split, n_configurations=80, name="train_data")

project.run()



Running DVC command: 'stage add --name trajectory --force ...'
Running DVC command: 'stage add --name test_split --force ...'
Running DVC command: 'stage add --name val_split --force ...'
Running DVC command: 'stage add --name test_data --force ...'
Running DVC command: 'stage add --name val_data --force ...'
Running DVC command: 'stage add --name train_data --force ...'
Running DVC command: 'repro'


2023-05-31 23:11:15,722 (DEBUG): Welcome to IPS - the Interatomic Potential Suite!
2023-05-31 23:11:15,915 (DEBUG): Selecting from 101 configurations.
2023-05-31 23:11:17,228 (DEBUG): Welcome to IPS - the Interatomic Potential Suite!
2023-05-31 23:11:17,459 (DEBUG): Selecting from 91 configurations.
2023-05-31 23:11:18,780 (DEBUG): Welcome to IPS - the Interatomic Potential Suite!
2023-05-31 23:11:19,031 (DEBUG): Selecting from 76 configurations.
2023-05-31 23:11:20,350 (DEBUG): Welcome to IPS - the Interatomic Potential Suite!
2023-05-31 23:11:20,561 (DEBUG): Selecting from 10 configurations.
2023-05-31 23:11:21,836 (DEBUG): Welcome to IPS - the Interatomic Potential Suite!
2023-05-31 23:11:22,081 (DEBUG): Selecting from 15 configurations.


The selection methods here are purely for demonstration purposes.
Usually it makes sense to use all available test data and not perform a sub-selection since the test set should only be evaluated once which does not pose a performance bottleneck.

In [10]:
!dvc dag

            +------------+                          
            | trajectory |                          
            +------------+                          
                   *                                
                   *                                
                   *                                
            +------------+                          
            | test_split |                          
            +------------+                          
            ***         ***                         
           *               *                        
         **                 **                      
+-----------+            +-----------+              
| test_data |            | val_split |              
+-----------+            +-----------+              
                        ***          ***            
                       *                *           
                     **                  **         
             +----------+            +--------

Below is a list of all currently implemented selection methods.
Check out the API docs for more information about the methods not covered here.

In [2]:
ips.configuration_selection.__all__

['ConfigurationSelection',
 'RandomSelection',
 'UniformEnergeticSelection',
 'UniformTemporalSelection',
 'UniformArangeSelection',
 'KernelSelection',
 'IndexSelection',
 'ThresholdSelection',
 'SplitSelection']

## Kernel Based Selection Methods

In [None]:
# TODO

In [None]:
temp_dir.cleanup()