# Processing Data

````{tip}
All of the {{ presets }} metioned below can be set during instantiation:
```python
magdata = MagentroData(..., presets={...})
```
````

In [None]:
from magentropy import MagentroData

magdata = MagentroData('magdata.dat')

## Grouping

Before smoothing, the magnetization data must be grouped by field.
Normally, the measured fields are not exact, so groups must be inferred.
The method {{ test_grouping }} can be used to test grouping presets prior to
fully processing the data.

In [None]:
grouping_presets, grouped_by = magdata.test_grouping()

With no arguments passed, the defaults in {{ presets }} are used. The method returns a dictionary
of the grouping presets used to perform the grouping and a {mod}`pandas` `DataFrameGroupBy` object
to see the results. In particular, the
{meth}`DataFrameGroupBy.count() <pandas.core.groupby.DataFrameGroupBy.count>`
method is useful.

In [None]:
grouping_presets

In [None]:
grouped_by['T'].count()

Above, we see that the default grouping presets are an empty array of `fields`, a `decimal` place
of 5 for rounding, and infinite `max_diff`. Detailed information on each of these can be found in
the {{ process_data }} documentation.

In this instance, the presets direct the grouping method to group the fields simply by rounding to
the 5th decimal place, which accurately determines the field groups, as shown by the `count()`
method. There are five fields, each with 100 temperature measurements. In most cases, grouping
by rounding should be sufficient.

## Smoothing

```{admonition} Reference
J. J. Stickel, [Comput. Chem. Eng. 34, 467 (2010)](https://doi.org/10.1016/j.compchemeng.2009.10.007)
```

There are a number of options to control the smoothing. The default presets have been chosen
sensibly but can be easily changed. All parameters, including grouping parameters, can be either
set as new defaults using {{ presets }} or {{ set_presets }}, or used for a single {{ process_data }}
run by entering an argument in {{ process_data }}. See the documentation for a complete
description of all parameters. They are summarized below. The use of {{ set_presets }} is
demonstrated in each case with the default presets purely for example; it is not necessary to set
presets if the defaults are to be used.

### Output temperatures

The smoothed magnetic moment will be evaulated at `npoints` evenly-spaced temperatures in the range
`temp_range`. `npoints` expects an integer, and `temp_range` expects an {term}`array_like` of
length 2. The default range `[-numpy.inf, numpy.inf]` adjusts to the maximum range in the data automatically.
Additionally, only those fields with at least `min_sweep_len` measured temperatures in their
respective temperature sweeps will be processed. The default is 10.

In [None]:
from numpy import inf

magdata.set_presets(npoints=1000, temp_range=[-inf, inf], min_sweep_len=10)

### Regularization

The two most important options for the regularization (smoothing) itself are the derivative order
`d_order` and the regularization parameter {math}`\lambda` for each field, `lmbds`.

The derivative of the magnetic moment with respect to temperature of order `d_order` is used to
quantify the "roughness" of the fitted curves. Generally, 2 or 3 work well. The default is 2.

The regularization parameter determines the empahsis that is given to the roughness
regularization penalty. A higher {math}`\lambda` results in a smoother curve, and a {math}`\lambda`
of zero results in interpolation. A {math}`\lambda` can be specified for each field
(in increasing field order) as an {term}`array_like` of the same length as the number of fields.
Any field with a corresponding {math}`\lambda` of {data}`numpy.nan` will have an "optimal"
{math}`\lambda` determined automatically; see [below](#optimal-regularization-parameters).
The default `lmbds` is an array with a single {data}`numpy.nan`, which indicates that an optimal
{math}`\lambda` should be found for each field. The same behavior occurs if an empty list
is given.

In [None]:
magdata.set_presets(d_order=2, lmbds=[])

### Optimal regularization parameters

Numerical optimization is used to determine the optimal regularization parameter for each field
without a {math}`\lambda` provided. Three metrics are available to quantify the meaning of
"optimal":

1. Generalized cross validation (GCV). The GCV variance is minimized. Set `match_err` to `False`
(default).

2. Error matching. The standard deviation of the absolute differences between the measured and
smoothed magnetic moment points is matched to a value. The squared difference between the standard
deviation and this value is minimized. Set `match_err` to a single value to match this value for
all fields, an {term}`array_like` of the same length as the number of fields to match a different
value for each field (in order of increasing field), or one of `'min'`, `'mean'`, or `'max'` to
use the minimum, mean, or maximum value of the error column for each field as the value.

3. Per-point error matching (experimental). The absolute differences between the measured and
smoothed magnetic moment points are computed, and the sum of squared differences between these
and the corresponding values in the error column is minimized. Set `match_err` to `True`.

Each of these requires an initial guess, given by `lmbd_guess`. Currently, a single guess to use
for all fields is supported. For control over the minimization, keyword arguments to pass
to {func}`scipy.optimize.minimize` can be given as a dictionary to `min_kwargs`. Keep in mind
that any values passed to `min_kwargs` should be with respect to {math}`\log_{10} \lambda`,
since this is the value that is minimized internally. (However, `lmbd_guess` is the guess for
{math}`\lambda` itself; no {math}`\log_{10}`.) Lastly, `weight_err` specifies whether to weight measurements by the
normalized inverse squares of the errors. The default is `True`.

See {{ process_data }} for full documentation.

In [None]:
magdata.set_presets(
    lmbd_guess=1e-4, weight_err=True, match_err=False,
    min_kwargs = {
        'method': 'Nelder-Mead',
        'bounds': ((-inf, inf),),
        'options': {'maxfev': 50, 'xatol': 1e-2, 'fatol': 1e-6}
    }
)

### Integrating from zero field

The calculation of entropy requires that the derivative of the magnetic moment with respect to
temperature be integrated with respect to magnetic field, starting at zero field. Zero field
measurements (with zero moment) are prepended before integration during processing, so it is not
necessary to include zero field measurements in the input data.

The zeros can be included in {{ processed_df }} if `add_zeros` is set to `True`. It is `False` by
default.

In [None]:
magdata.set_presets(add_zeros=False)

## Demonstration

Simple usage of {{ process_data }} is shown, including the adjustment of the regularization
parameters by eye after they have been estimated initially. Plots are used to verify the
success of the smoothing. See {doc}`plotting` for more information.

In [None]:
magdata.process_data()

In [None]:
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 180

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 2, figsize=(14, 5))

magdata.plot_lines(data_prop='M_per_mass', data_version='compare', ax=ax[0])

magdata.plot_lines(
    data_prop='dM_dT', data_version='compare', ax=ax[1], colorbar=True,
    colorbar_kwargs={'ax': ax, 'fraction': 0.1, 'pad': 0.02}
);

```{note}
The errors in this fake data are greatly exaggerated! Most instruments will have relative errors
much smaller than those shown here.
```

We can see that the smoothed data (lines) looks much better than the raw data (dots), especially
in the derivative plot on the right. Generalized cross validation has done a pretty good job of
selecting optimal regularization parameters, which we can view using {{ last_presets }}:

In [None]:
magdata.last_presets['lmbds']

The {{ presets }} are the same as they were before; however, setting them to {{ last_presets }} is simple:

In [None]:
magdata.presets = magdata.last_presets
magdata.presets

We could also adjust `lmbds` for a single run and re-process:

In [None]:
magdata.process_data(lmbds=[1e-4, 5e-5, 1e-4, 1e-5, 1e-5])

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(14, 5))
magdata.plot_lines(data_prop='M_per_mass', data_version='compare', ax=ax[0])
magdata.plot_lines(
    data_prop='dM_dT', data_version='compare', ax=ax[1], colorbar=True,
    colorbar_kwargs={'ax': ax, 'fraction': 0.1, 'pad': 0.02}
);

The error column in {{ processed_df }} will still be empty after all this.
See {doc}`bootstrap_estimates`.