# Analysing Datasets and Model Predictions

In the last example we already saw how to obtain a simple prediction-label correlation plot.
However, for most real-life applications, this alone would be insufficient to ascertain the reliability of the model across configuration space.
This notebook goes into more detail about how to use the various analysis tools implemented in IPS.

## Data Generation and Training

We will once again create a simple dataset.

In [1]:
import ipsuite as ips
from zntrack.utils import cwd_temp_dir

temp_dir = cwd_temp_dir()

import ipsuite as ips

import os
from ase import units
from ase.calculators.emt import EMT
from ase.io.trajectory import TrajectoryWriter
from ase.lattice.cubic import FaceCenteredCubic
from ase.md.velocitydistribution import MaxwellBoltzmannDistribution
from ase.md.langevin import Langevin
from ase.visualize import view


2023-06-06 22:00:51,620 (DEBUG): Welcome to IPS - the Interatomic Potential Suite!


In [2]:
!git init
!dvc init

hint: Using 'master' as the name for the initial branch. This default branch name
hint: is subject to change. To configure the initial branch name to use in all
hint: 
hint: 	git config --global init.defaultBranch <name>
hint: 
hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
hint: 'development'. The just-created branch can be renamed via this command:
hint: 
hint: 	git branch -m <name>
Initialized empty Git repository in /tmp/tmpjx247w5h/.git/
Initialized DVC repository.

You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+--------------------------

In [3]:
size = 3

# Set up a crystal
atoms = FaceCenteredCubic(
    directions=[[1, 0, 0], [0, 1, 0], [0, 0, 1]],
    symbol='Cu',
    size=(size, size, size),
    pbc=True
)

In [4]:
timestep = 5 * units.fs
steps = 100
temperature = 800
traj_path = os.path.join(temp_dir.name, "trajectory.traj")

atoms.calc = EMT()
MaxwellBoltzmannDistribution(atoms, temperature_K=temperature)

dyn = Langevin(atoms, timestep, temperature_K=temperature, friction=0.002)

writer = TrajectoryWriter(traj_path, "w", atoms=atoms)
dyn.attach(writer, interval=1)

dyn.run(steps)

True

## Dataset Analysis

This time around, we will explore the dataset a bit before training models on it.
It is often useful to visualize the distribution of labels 