In [1]:
import logging
import warnings

import cmomy
import numpy as np

rng = cmomy.random.default_rng(0)

np.set_printoptions(precision=4)
warnings.filterwarnings("ignore")


logger = logging.getLogger()
logger.setLevel(logging.ERROR)

# Data Organization


All of the extrapolation and interpolation models in the thermoextrap package expect input data to be organized in a certain fashion. To help manage the data, there are data objects that help organize it. Even the inputs to these data object, however, must be organized appropriately. Here we will use data from our ideal gas test system (from {mod}`thermoextrap.idealgas`) to demonstrate this organization, as well as the various options for what types of data that may be provided as input.

In [2]:
# need imports
%matplotlib inline
import numpy as np
import xarray as xr

# import thermoextrap
import thermoextrap as xtrap

# Import idealgas module
from thermoextrap import idealgas

In [3]:
# Define reference beta
beta_ref = 5.6

# And maximum order
order = 2

npart = 1000  # Number of particles (in single configuration)
nconfig = 100_000  # Number of configurations

# Generate all the data we could want
xdata, udata = idealgas.generate_data((nconfig, npart), beta_ref)

Refer to {mod}`thermoextrap.data` for more information on the data classes.

## Basics

Rather than passing data directly to `__init__` methods for creating data class objects and simultaneously telling it which dimensions mean what (or expecting that specific dimensions mean a certain thing), {mod}`thermoextrap` uses {mod}`xarray` to label the dimensions of inputs. While this is also useful in the background, it helps to clarify what is expected of user inputs.

Currently, `xdata` is of the shape (nconfig), or the number of configurations generated with each entry being the average $x$ location for the associated configuration.

In [4]:
print(nconfig, xdata.shape)

100000 (100000,)


The dimension over which independent samples vary is the "record" dimension, with its default name in {mod}`thermoextrap.data` being 'rec'. So when we create an  {class}`xarray.DataArray` object to house the input $x$ data, we must label that dimension 'rec'. Same goes for the input potential energy data. Note that the list provided to the argument `dims` is a list of strings naming the dimensions in the array passed to {class}`xarray.DataArray`.

In [5]:
xdata = xr.DataArray(xdata, dims=["rec"])
udata = xr.DataArray(udata, dims=["rec"])

Now when we create a data object in {mod}`thermoextrap` to hold the data, we tell it that the "record" dimension, `rec_dim` is named 'rec', which is the default, but it could be named something different as long as you provided that name to `rec_dim`.

Note that the `xv` is the argument for the observable $x$ and `uv` is the potential energy or appropriate Hamiltonian or thermodynamic conjugate variable.

In [6]:
data = xtrap.DataCentralMomentsVals.from_vals(
    order=order, rec_dim="rec", xv=xdata, uv=udata, central=True
)

TypeError: xCentralMoments.from_vals() got an unexpected keyword argument 'w'

A couple more notes are in order about the inputs to any of the {mod}`thermoextrap.data` object variants. First, you only need to provide the order you expect to extrapolate to up front if you're using the {meth}`~thermoextrap.data.DataCentralMomentsVals.from_vals` constructor. This is because you need to specify the order of moments that will be calculated from the raw data.

The next argument to be aware of is `central`. This is True by default and tells the data object to work with central moments for calculating derivatives in the background, which it turns out is much more numerically stable than non-central moments. You probably want `central` to be True, but know that you can change it if you wish.

## Data Structure

```{eval-rst}
.. currentmodule:: thermoextrap.data
```



A lot of data is already computed as soon as we create our data object. The original raw data is still stored in {attr}`~DataCentralMomentsVals.xv` and {attr}`~DataCentralMomentsVals.uv`, and order is {attr}`~DataCentralMomentsVals.order`, but we can already see the central moments appearing if we look at...

In [7]:
data.xv

NameError: name 'data' is not defined

In [8]:
data.xave

NameError: name 'data' is not defined

{attr}`~DataCentralMomentsVals.xave` is the average observable value.

In [9]:
data.u

NameError: name 'data' is not defined

{attr}`~DataCentralMomentsVals.u` are the moments of the potential energy, i.e., $\langle U^i \rangle$ for the $i^\mathrm{th}$ index in the array.

For the central moments of the potential energy, $\langle (U - \langle U \rangle )^i \rangle$, you can look at {attr}`~DataCentralMomentsVals.du`

In [10]:
data.du

NameError: name 'data' is not defined

The other necessary component for calculating derivatives is $\langle x U^i \rangle$, which is in {attr}`DataCentralMomentsVals.xu`

In [11]:
data.xu

NameError: name 'data' is not defined

Or if working with central moments, $\langle (x - \langle x \rangle) (U - \langle U \rangle)^i \rangle$, is in {attr}`DataCentralMomentsVals.dxdu`

In [12]:
data.dxdu

NameError: name 'data' is not defined

All of this information is condensed in {attr}`~DataCentralMomentsVals.values`, which takes only exactly what we need for computing derivatives.

In [13]:
data.values

NameError: name 'data' is not defined

Understanding this internal structure will help to understand possible inputs as well. In {attr}`~DataCentralMoments.values`, the data object has stored the central moments of $\langle x \rangle$ and $\langle U^i \rangle$ and cross moments $\langle x U^i \rangle$. Note that the second dimension, "umom", short for "U moments", is just `order` plus 1. That makes sense if we remember that the zeroth order derivative is the observable itself and we specify that we want to use up to `order` derivatives. So the second dimension involves setting the exponent $i$ on $U$ equal to the index of that dimension, or order of that moment. The first dimension does the same thing, but with $x$. Regardless of the order, however, we only ever need $x$ raised to the zeroth or first power in the average.

The first row in `values` contains all moments of just $U$, i.e., $\langle U^0 \rangle$, $\langle U^1 \rangle$, $\langle U^2 \rangle$, etc. The second row in `values` contains all moments of $x$ multiplied by $U$, i.e., $\langle x U^0 \rangle$, $\langle x U^1 \rangle$, etc.  But note that beyond the powers of 0 and 1 for the first row, and just 0 for the second row, all values shown are central moments, e.g., $\langle (x - \langle x \rangle)(U - \langle U \rangle)^i \rangle$ or $\langle (U - \langle U \rangle)^i \rangle$.

In other words, {attr}`~DataCentralMomentsVals.values` is a special array with structure...

for i + j <= 1: <br>
`data.values[0, 0]` = {sum of weights or count} <br>
`data.values[1, 0]` = {ave of x} = $\langle x \rangle$ <br>
`data.values[0, 1]` = {ave of u} = $\langle U \rangle$ <br>

for i + j > 1: <br>
`data.values[i, j]` = $\langle (x - \langle x \rangle)^i (U - \langle U \rangle)^j \rangle$

To summarize, {attr}`~DataCentralMomentsVals.values` contains the bare bones of what is required for calculating derivatives and will be shared in some form or another across all data classes, with this information passed to the functions that compute derivative.

## Input formats and resampling

Since {attr}`~DataCentralMomentsVals.values` reflects the internal structure, you can just provide it directly (or something similar to it in terms of moments) if you prefer to do that. You'll just need to use a different data class {class}`DataCentralMoments` and a different constructor method {meth}`DataCentralMoments.from_vals`.

While {class}`DataCentralMomentsVals` is designed to work with 'values' (i.e., individual observations), {class}`DataCentralMoments` is designed to work with moments.  Both classes can be constructed from 'values', but  {class}`DataCentralMomentsVals` retains the underlying values (for resampling, etc), {class}`DataCentralMoments` converts the values to moments, and goes from there.  Basically, if you have pre-computed moments (e.g., from a simulation), {class}`DataCentralMoments` is probably what you want to use.  Note that resampling for {class}`DataCentralMoments` is based on resampling over multiple samples the moments.

For example, if we construct an {class}`DataCentralMoments` object using the {meth}`DataCentralMoments.from_data` constructor, we have:

In [14]:
data_noboot = xtrap.DataCentralMoments.from_data(data.values)
xr.testing.assert_allclose(data_noboot.values, data.values)
data_noboot.values

NameError: name 'data' is not defined

Which is identical to `data` above.  Note that the order here is inferred from the passed moments array. Likewise, we could have just created this from values using {meth}`DataCentralMoments.from_vals`

In [15]:
data_noboot = xtrap.DataCentralMoments.from_vals(
    xv=xdata, uv=udata, rec_dim="rec", central=True, order=order
)
xr.testing.assert_allclose(data_noboot.values, data.values)
data_noboot.values

TypeError: xCentralMoments.from_vals() got an unexpected keyword argument 'w'

However, since `data_noboot` is based on just a single average, bootstrapping makes little sense.  For example:

In [16]:
data_noboot = xtrap.DataCentralMoments.from_raw(data.values)

try:
    data_noboot.resample(nrep=3).values
except ValueError as e:
    print("caught error!")
    print(e)

NameError: name 'data' is not defined

versus...

In [17]:
data.resample(nrep=3).values

NameError: name 'data' is not defined

Note that whenever you call `.resample()` a new dimension is created for the output called 'rep', short for "repetitions." This is similar to the 'rec' dimension, but helps you keep track of whether you're working with original data, or a bootstrapped sample.

So clearly if you like to provide raw moments and the uncertainty quantification in `thermoextrap` to work, you will need to do this from blocks of your data or from repeated simulations. But the code will all still work if you prefer to calculate your own moments (saving them periodically from simulations rather than saving all of the configurations, energies, or observables frequently).

As an example, we can work with blocks of data, adding an axis called 'block' that we will ask the constructor to average over by specifying the `dim` argument.

In [18]:
# Make 100 averaged observations
xx = xr.DataArray(xdata.values.reshape(100, -1), dims=["rec", "block"])
uu = xr.DataArray(udata.values.reshape(100, -1), dims=["rec", "block"])

In [19]:
# Create directly from values of moments - notice that this is DataCentralMoments, not DataCentralMomentsVals
# Effectively just means that the 'rec' dim will not be collapsed when using data for extrapolation, etc.
# Behaves similarly to the 'rep' dim when resampling
data_fv = xtrap.DataCentralMoments.from_vals(
    xv=xx, uv=uu, dim="block", order=order, central=True
)
# So 'rec' is for each separate block average
data_fv.values

TypeError: xCentralMoments.from_vals() got an unexpected keyword argument 'w'

Again, note that we did not use the {class}`DataCentralMomentsVals` class above, but instead the {class}`DataCentralMoments` class. The former is for processing and storing simulation "timeseries" of observable and potential energy values, while the latter takes pre-computed moments, including multiple replicates of precomputed moments as above. Behind the scenes, this will influence how bootstrapped confidence intervals are computed.

What's functionally different, though, is that the 'rec' dim also appears in `.values`. That means that when this data is used in models for extrapolation or interpolation, that dimension will also be preserved. So prediction to a new $\beta$ value will result in an output matching the size of the 'rec' dimension in the same way that it would match the 'rep' dimension created through resampling.

If we resample over this data set, we see that we just take `nrep` random samples from it, putting those samples into a new dimension called 'rep'.

In [20]:
data_fv.resample(nrep=3).values

NameError: name 'data_fv' is not defined

If we had computed the moments from the blocked data ourselves, we could also create a data object with the {meth}`DataCentralMoments.from_ave_raw` constructor (below). Many other constructors exist, including from central moments if you like. If you use those, please take a look at the documentation to make sure you are specifying or using the correct dimension naming conventions, such as 'rec', 'xmom', 'umom', etc. Remember, if you are extrapolating an observable that has an explicit dependence on the extrapolation observable, you also need to specify the 'deriv' dimension that describes the observed derivatives with respect to the extrapolation variable (see the [Temperature extrapolation case 2](./Temperature_Extrap_Case2.ipynb) notebook).

In [21]:
# Compute moments of U, i.e., averages to integer powers up to maximum order desired
mom_u = xr.DataArray(np.arange(order + 1), dims=["umom"])
uave = (uu**mom_u).mean("block")
xuave = (xx * uu**mom_u).mean("block")
data_fa = xtrap.DataCentralMoments.from_ave_raw(
    u=uave, xu=xuave, central=True, w=xx.sizes["block"]
)

xr.testing.assert_allclose(data_fv.values, data_fa.values)

TypeError: DataCentralMoments.from_ave_raw() got an unexpected keyword argument 'w'

The above `.values` should be identical to those from the `from_vals` constructor.

At this point, we have seen how the same data objects that interface with extrapolation or interpolation models can be created from different inputs. Other features also exist, such as specifying weights with the argument `w` to a constructor to change the weights used during averaging.

## Vector observables

Finally, we can also have vector observables, like RDFs, for example. This is easy to accomplish with any of the above constructors or types of data input. All that is required is to add another dimension to our {class}`xarray.DataArray` input. Typically, we will call this dimension 'vals', short for "values", which is the default name for this dimension when using the {meth}`DataCentralMomentsVals.from_vals` constructor.

In [22]:
# Extrapolate both average x and average x**2
x_xsq_data = xr.DataArray(
    np.vstack([xdata.values, xdata.values**2]).T,
    dims=["rec", "vals"],
    coords={"vals": ["x", "xsq"]},
)
data_vec = xtrap.DataCentralMomentsVals.from_vals(
    order=order, rec_dim="rec", xv=x_xsq_data, uv=udata, central=True
)

TypeError: xCentralMoments.from_vals() got an unexpected keyword argument 'w'

In [23]:
data_vec.values

NameError: name 'data_vec' is not defined

In [24]:
data_vec.resample(nrep=3).values

NameError: name 'data_vec' is not defined

Note that we have simply added a dimension along which all the same operations happen, but independently for different data. The behavior is identical if we work instead with data from other constructors.

In [25]:
xx_xsqxsq = xr.DataArray(
    x_xsq_data.values.reshape(100, -1, 2),
    dims=["rec", "block", "vals"],
    coords={"vals": ["x", "xsq"]},
)
x_xsq_uave = (xx_xsqxsq * uu**mom_u).mean("block")
data_fa_vec = xtrap.DataCentralMoments.from_ave_raw(
    u=uave, xu=x_xsq_uave, central=True, w=xx_xsqxsq.sizes["block"]
)

TypeError: DataCentralMoments.from_ave_raw() got an unexpected keyword argument 'w'

In [26]:
data_fa_vec.reduce("rec").values

NameError: name 'data_fa_vec' is not defined

In [27]:
data_fa_vec.resample(nrep=3).values

NameError: name 'data_fa_vec' is not defined