In [1]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

import xarray as xr

In [2]:
import thermoextrap as xtrap

# Data Organization

All of the extrapolation and interpolation models in the thermoextrap package expect input data to be organized in a certain fashion. To help manage the data, there are data objects that help organize it. Even the inputs to these data object, however, must be organized appropriately. Here we will use data from our ideal gas test system to demonstrate this organization, as well as the various options for what types of data that may be provided as input.

In [3]:
#Import idealgas module
from thermoextrap import idealgas

#Define reference beta
beta_ref = 5.6

#And maximum order
order = 2

npart = 1000 #Number of particles (in single configuration)
nconfig = 100_000 #Number of configurations

#Generate all the data we could want
xdata, udata = idealgas.generate_data((nconfig, npart), beta_ref)

In [4]:
help(xtrap.core.data)

Help on module thermoextrap.core.data in thermoextrap.core:

NAME
    thermoextrap.core.data - Routines and classes to process input data to expected average format.

DESCRIPTION
    The general scheme is to use the following:
    
    * uv, xv -> samples (values) for u, x
    * u, xu -> averages of u and x*u
    * u[i] = <u**i>
    * xu[i] = <x * u**i>
    * xu[i, j] = <d^i x/d beta^i * u**j

CLASSES
    thermoextrap.core._attrs_utils.MyAttrsMixin(builtins.object)
        AbstractData
            DataCentralMomentsBase
                DataCentralMoments
                DataCentralMomentsVals
            DataValuesBase
                DataValues
                DataValuesCentral
        DataCallbackABC
            DataCallback
        DatasetSelector
    
    class AbstractData(thermoextrap.core._attrs_utils.MyAttrsMixin)
     |  AbstractData(*, meta) -> None
     |  
     |  Baseclass for adding some sugar to attrs.derived classes.
     |  
     |  Method resolution order:
     |     

## Basics

Rather than passing data directly to `__init__` methods for creating data class objects and simultaneously telling it which dimensions mean what (or expecting that specific dimensions mean a certain thing), `thermoextrap` uses xarray to label the dimensions of inputs. While this is also useful in the background, it helps to clarify what is expected of user inputs.

Currently, `xdata` is of the shape (nconfig), or the number of configurations generated with each entry being the average $x$ location for the associated configuration.

In [5]:
print(nconfig, xdata.shape)

100000 (100000,)


The dimension over which independent samples vary is the "record" dimension, with its default name in `thermoextrap` being 'rec'. So when we create an xarray `DataArray` object to house the input $x$ data, we must label that dimension 'rec'. Same goes for the input potential energy data. Note that the list provided to the argument `dims` is a list of strings naming the dimensions in the array passed to `DataArray`.

In [6]:
xdata = xr.DataArray(xdata, dims=['rec'])
udata = xr.DataArray(udata, dims=['rec'])

Now when we create a data object in `thermoextrap` to hold the data, we tell it that the "record" dimension, `rec_dim` is named 'rec', which is the default, but it could be named something different as long as you provided that name to `rec_dim`.

Note that the `xv` is the argument for the observable $x$ and `uv` is the potential energy or appropriate Hamiltonian or thermodynamic conjugate variable.

In [7]:
data = xtrap.DataCentralMomentsVals.from_vals(order=order, 
                                                  rec_dim='rec',
                                                  xv=xdata, uv=udata, central=True)

  warn(


A couple more notes are in order about the inputs to any of the `thermoextrap` data object variants. First, you only need to provide the order you expect to extrapolate to up front if you're using the `from_vals` constructor. This is because you need to specify the order of moments that will be calculated from the raw data.

The next argument to be aware of is `central`. This is True by default and tells the data object to work with central moments for calculating derivatives in the background, which it turns out is much more numerically stable than non-central moments. You probably want `central` to be True, but know that you can change it if you wish.

## Data Structure

A lot of data is already computed as soon as we create our data object. The original raw data is still stored in `.xv` and `.uv`, and order is `.order`, but we can already see the central moments appearing if we look at...

In [8]:
data.xave

`.xave` is the average observable value.

In [9]:
data.u

`.u` are the moments of the potential energy, i.e., $\langle U^i \rangle$ for the $i^\mathrm{th}$ index in the array.

For the central moments of the potential energy, $\langle (U - \langle U \rangle )^i \rangle$, you can look at `.du`

In [10]:
data.du

The other necessary component for calculating derivatives is $\langle x U^i \rangle$, which is in `.xu`

In [11]:
data.xu

Or if working with central moments, $\langle (x - \langle x \rangle) (U - \langle U \rangle)^i \rangle$, is in `.dxdu`

In [12]:
data.dxdu

All of this information is condensed in `.values`, which takes only exactly what we need for computing derivatives.

In [13]:
data.values

Understanding this internal structure will help to understand possible inputs as well. In `.values`, the data object has stored the central moments of $\langle x \rangle$ and $\langle U^i \rangle$ and cross moments $\langle x U^i \rangle$. Note that the second dimension, "umom", short for "U moments", is just `order` plus 1. That makes sense if we remember that the zeroth order derivative is the observable itself and we specify that we want to use up to `order` derivatives. So the second dimension involves setting the exponent $i$ on $U$ equal to the index of that dimension, or order of that moment. The first dimension does the same thing, but with $x$. Irregardless of the order, however, we only ever need $x$ raised to the zeroth or first power in the average.

The first row in `.values` contains all moments of just $U$, i.e., $\langle U^0 \rangle$, $\langle U^1 \rangle$, $\langle U^2 \rangle$, etc. The second row in `.values` contains all moments of $x$ multiplied by $U$, i.e., $\langle x U^0 \rangle$, $\langle x U^1 \rangle$, etc.  But note that beyond the powers of 0 and 1 for the first row, and just 0 for the second row, all values shown are central moments, e.g., $\langle (x - \langle x \rangle)(U - \langle U \rangle)^i \rangle$ or $\langle (U - \langle U \rangle)^i \rangle$.

In other words, `.values` is a special array with structure...

for i + j <= 1: <br>
`data.values[0, 0]` = {sum of weights or count} <br>
`data.values[1, 0]` = {ave of x} = $\langle x \rangle$ <br>
`data.values[0, 1]` = {ave of u} = $\langle U \rangle$ <br>

for i + j > 1: <br>
`data.values[i, j]` = $\langle (x - \langle x \rangle)^i (U - \langle U \rangle)^j \rangle$

To summarize, `.values` contains the bare bones of what is required for calculating derivatives and will be shared in some form or another across all data classes, with this information passed to the functions that compute derivative.

## Input formats and resampling

Since `.values` reflects the internal structure, you can just provide it directly (or something similar to it in terms of moments) if you prefer to do that. You'll just need to use a different constuctor method than `from_vals`.

A big caveat, though. All uncertainty estimation happens through bootstrap resampling along the 'rec' dimension of the originally provided data. So if you just pass in the central moments above, you won't be able to calculate uncertainties...

In [14]:
data_noboot = xtrap.DataCentralMoments.from_raw(data.values)

try:
    data_noboot.resample(nrep=3).values
except ValueError as e:
    print('caught error!')
    print(e)

caught error!
not implemented for scalar


versus...

In [15]:
data.resample(nrep=3).values

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.


Even if you add a 'rec' dimension, bootstrapping with a single observation ('rec' dim of size 1) won't do anything...

In [16]:
data_noboot = xtrap.DataCentralMoments.from_raw(xr.DataArray(data.values.values[None, ...], dims=['rec', 'xmom', 'umom']))

In [17]:
data_noboot.resample(nrep=3).values

You just resample the same thing every time!

Note that whenever you call `.resample()` a new dimension is created for the output called 'rep', short for "repetitions." This is similar to the 'rec' dimension, but helps you keep track of whether you're working with original data, or a boostrapped sample.

So clearly if you like to provide raw moments and the uncertainty quantification in `thermoextrap` to work, you will need to do this from blocks of your data or from repeated simulations. But the code will all still work if you prefer to calculate your own moments (saving them periodically from simulations rather than saving all of the configurations, energies, or observables frequently).

As an example, we can work with blocks of data, adding an axis called 'block' that we will ask the constructor to average over by specifying the `axis` argument.

In [18]:
# Make 100 averaged observations
xx = xr.DataArray(xdata.values.reshape(100, -1), dims=['rec', 'block'])
uu = xr.DataArray(udata.values.reshape(100, -1), dims=['rec', 'block'])

In [19]:
# Create directly from values of moments - notice that this is DataCentralMoments, not DataCentralMomentsVals
# Effectively just means that the 'rec' dim will not be collapsed when using data for extrapolation, etc.
# Behaves similarly to the 'rep' dim when resampling
data_fv = xtrap.DataCentralMoments.from_vals(xv=xx, uv=uu, axis='block', order=order, central=True)
# So 'rec' is for each separate block average
data_fv.values

  warn(


Note that we did not use the `DataCentralMomentsVals` class above, but instead the `DataCentralMoments` class. The former is for processing and storing simulation "timeseries" of observable and potential energy values, while the latter takes pre-computed moments, including multiple replicates of precomputed moments as above. Behind the scenes, this will influence how bootstrapped confidence intervals are computed.

What's functionally different, though, is that the 'rec' dim also appears in `.values`. That means that when this data is used in models for extrapolation or interpolation, that dimension will also be preserved. So prediction to a new $\beta$ value will result in an output matching the size of the 'rec' dimension in the same way that it would match the 'rep' dimension created through resampling.

If we resample over this data set, we see that we just take `nrep` random samples from it, putting those samples into a new dimension called 'rep'.

In [20]:
data_fv.resample(nrep=3).values

  warn(


If we had computed the moments from the blocked data ourselves, we could also create a data object with the `from_ave_raw` constructor (below). Many other constructors exist, including from central moments if you like. If you use those, please take a look at the documentation to make sure you are specifying or using the correct dimension naming conventions, such as 'rec', 'xmom', 'umom', etc. Remember, if you are extrapolating an observable that has an explicit dependence on the extrapolation observable, you also need to specify the 'deriv' dimension that describes the observed derivatives with respect to the extrapolation variable (see the "Temperature_Extrap_Case2" notebook).

In [21]:
# Compute moments of U, i.e., averages to integer powers up to maximum order desired
mom_u = xr.DataArray(np.arange(order + 1), dims=['umom'])
uave = (uu ** mom_u).mean('block')
xuave = (xx * uu ** mom_u).mean('block')
data_fa = xtrap.DataCentralMoments.from_ave_raw(u=uave, xu=xuave, central=True)

In [22]:
data_fa.values

The above `.values` should be identical to those from the `from_vals` constructor.

At this point, we have seen how the same data objects that interface with extrapolation or interpolation models can be created from different inputs. Other features also exist, such as specifying weights with the argument `w` to a constructor to change the weights used during averaging.

## Vector observables

Finally, we can also have vector observables, like RDFs, for example. This is easy to accomplish with any of the above constructors or types of data input. All that is required is to add another dimension to our xarray `DataArray` input. Typically, we will call this dimension 'vals', short for "values", which is the default name for this dimension when using the `DataCentralMomentVals.from_vals` constructor.

In [23]:
# Extrapolate both average x and average x**2
x_xsq_data = xr.DataArray(np.vstack([xdata.values, xdata.values**2]).T, dims=['rec', 'vals'], coords={'vals': ['x','xsq']})
data_vec = xtrap.DataCentralMomentsVals.from_vals(order=order, 
                                                      rec_dim='rec',
                                                      xv=x_xsq_data, uv=udata, central=True)

  warn(


In [24]:
data_vec.values

In [25]:
data_vec.resample(nrep=3).values

Note that we have simply added a dimension along which all the same operations happen, but independently for different data. The behavior is identical if we work instead with data from other constructors.

In [26]:
xx_xsqxsq = xr.DataArray(x_xsq_data.values.reshape(100, -1, 2), dims=['rec', 'block', 'vals'], coords={'vals': ['x','xsq']})
x_xsq_uave = (xx_xsqxsq * uu ** mom_u).mean('block')
data_fa_vec = xtrap.DataCentralMoments.from_ave_raw(u=uave, xu=x_xsq_uave, central=True)

In [27]:
data_fa_vec.resample(nrep=3).values

  warn(
