# Gaussian process regression and active learning

Now we'll look at the same problem as in the previous tutorial, but replace the hardcoded constitutive laws with purely data-driven ones.
In multiscale simulations, this data comes from molecular dynamics (MD) simulations.
For illustrative purposes, we sample the training data from the same hard-coded constituive laws as before and pretend it comes from an MD run.
Thus, we generate a *Mock* of an actual MD simulation. Real MD data is noisy, so we'll add some random noise to our mock-up data as well.
As in the previous example, we start with the *YAML* input file:

In [None]:
journal_gp_input = """
options:
    output: data/journal_gp
    write_freq: 100
    use_tstamp: True
grid:
    dx: 1.e-5
    dy: 1.
    Nx: 100
    Ny: 1
    xE: ['D', 'N', 'N']
    xW: ['D', 'N', 'N']
    yS: ['P', 'P', 'P']
    yN: ['P', 'P', 'P']
    xE_D: 877.7007
    xW_D: 877.7007
geometry:
    type: journal
    CR: 1.e-2
    eps: 0.7
    U: 0.1
    V: 0.
numerics:
    CFL: 0.25
    adaptive: 1
    tol: 1e-9
    dt: 1e-10
    max_it: 2_500
properties:
    shear: 0.0794
    bulk: 0.
    EOS: DH
    P0: 101325
    rho0: 877.7007
    T0: 323.15
    C1: 3.5e10
    C2: 1.23
gp:
    press:
        fix_noise: True
        atol: 1.
        rtol: 0.1
        obs_stddev: 100.
        max_steps: 5
    shear:
        fix_noise: True
        atol: 1.
        rtol: 0.1
        obs_stddev: 1.
        max_steps: 5
db:
    dtool: True
    init_size: 5
    init_method: lhc
    init_width: 1.e-6
"""

The first part is identical, but we recognize two new sections, `gp` and `db`, which control the settings for the Gaussian process regression and the underlying training database, respectively. We start by loading the problem as usual:

In [None]:
from GaPFlow import Problem
myProblem = Problem.from_string(journal_gp_input)

Before we unpack what we see here, a word of caution regarding wording and notation.

There are two types of models, which can be replaced by a GP: the pressure/normal stress and the viscous shear stress.
Here, we sometimes use the terms *pressure* and *normal stress* synonymously, although they are strictly speaking different things.
For instance, the normal stress component is given by $\sigma_{zz} = -p(\rho) + \tau_{zz}$, but we assume that viscous 
contributions to the normal stress (here: $\tau_{zz}$) are small. 
This is not a bad assumption as we have seen in [Tutorial 2](02_stress_sympy.ipynb).

In this example, where we make use of hard-coded constitutive laws, we would be able to include both effects separately,
but with actual MD data this is not the case. Thus, in this example, what we call pressure is the actual thermodynamic pressure, 
and we explicitly set viscous normal stresses to zero. In contrast, with *actual* MD data, what we call pressure is the normal stress $\sigma_{zz}$,
but we assume that it is the same in the other two directions (thus we handle it numerically as it was the pressure).
Similarly, we ignore all viscous shear stress components except $\tau_{xz}$ and $\tau_{yz}$, whenever we use GPs, since these are the only
ones we measure in an MD run.

---

Upon loading the problem, the sanitized configuration of the GP settings look like this:
```
- gp:
  - press_gp                 : True
  - shear_gp                 : True
  - press:
    - atol                   : 1.0
    - rtol                   : 0.1
    - obs_stddev             : 100.0
    - fix_noise              : True
    - max_steps              : 5
    - pause_steps            : 100
    - active_learning        : True
    - active_dims            : [0, 3]
  - shear:
    - atol                   : 1.0
    - rtol                   : 0.1
    - obs_stddev             : 1.0
    - fix_noise              : True
    - max_steps              : 5
    - pause_steps            : 100
    - active_learning        : True
    - active_dims_x          : [0, 1, 3]
    - active_dims_y          : [0, 2, 3]
```

The first two entries `press_gp` and `shear_gp` indicate that both GP models are active.

In [None]:
print('Pressure GP: ', myProblem.pressure.is_gp_model)
print('Wall shear stress (xz) GP: ', myProblem.wall_stress_xz.is_gp_model)
print('Wall shear stress (yz) GP: ', myProblem.wall_stress_yz.is_gp_model)
print('Bulk viscous stress GP: ', myProblem.bulk_stress.is_gp_model)

Since we are looking at a one-dimensional problem, the wall shear stress in $y$ direction is irrelevant and therefore not replaced with a surrogate. 
The gap-averaged viscous stress components (`bulk_stress`) are never replaced by a GP model, and are set to zero in this example (see remark above).

GP models are connected to a database, which is configured with the following settings:
```
- db:
  - dtool_path               : None
  - init_size                : 5
  - init_method              : lhc
  - init_width               : 1e-06
  - init_seed                : 0
```

Both models read from and write to the same database, which is initially empty:

In [None]:
# Both models are connected to the same database...
print(myProblem.pressure.database)
print(myProblem.wall_stress_xz.database)

# ...but the database is initially empty
print(myProblem.pressure.database.size)
print(myProblem.pressure.database.Xtrain)
print(myProblem.pressure.database.Ytrain)
print(myProblem.pressure.database.training_path)

The database configurations specifies that 5 (`init_size`) datapoints should be initialized via Latin hypercube sampling (`lhc`).
The bounds for the LHC sampling are determined autmatically from the initial conditions, but we can modify the bounds for the density
individually wiht the `init_width` argument, which gives the half width of the density interval relative to the initial condition (the default is $\pm1\%$). 
Here, it is quite small due to the nearly incompressible fluid. Before we can run a simulation, we have to initialize the database.
Usually, this is done automatically when we call `GaPFlow.problem.Problem.run()`. Here, since we want to run the simulation 'manually', we have to call `GaPFlow.problem.Problem._pre_run()` first:

In [None]:
myProblem._pre_run()

The output tells us that five *Mock* MD simulations "ran" with the inputs as specified.
Let's check the size of the database again.

In [None]:
print('Database size: ', myProblem.pressure.database.size)
print('Database features: ', myProblem.pressure.database.num_features)
print('X (shape)', myProblem.pressure.database.Xtrain.shape)
print('Y (shape)', myProblem.pressure.database.Ytrain.shape)
print('Y error (shape)', myProblem.pressure.database.Ytrain_err.shape)

We see that the database has been filled with the requested five datapoints. Each point is determined by seven features:

- the density $\rho$
- the flux in x direction $j_x$
- the flux in y direction $j_y$
- the gap height $h$
- the x gradient of the gap $\partial h/\partial x$
- the y gradient of the gap $\partial h/\partial y$
- an extra feature (by default zero)

We can append an arbitrary number of extra features. By default there is just one but it is not used. 
You can check [the example on wall slip](../../examples/slip_1d_lj_mock.py) to see how the extra arguments can be used in a simulation.

We also see, that there are in total 13 outputs (or observations) stored within the database.
The outputs are stored in the following order:

- $p$
- $\textcolor{grey}{\tau_{xx}^\mathrm{bot}}$
- $\textcolor{grey}{\tau_{yy}^\mathrm{bot}}$
- $\textcolor{grey}{\tau_{zz}^\mathrm{bot}}$
- $\tau_{yz}^\mathrm{bot}$
- $\tau_{xz}^\mathrm{bot}$
- $\textcolor{grey}{\tau_{xy}^\mathrm{bot}}$
- $\textcolor{grey}{\tau_{xx}^\mathrm{top}}$
- $\textcolor{grey}{\tau_{yy}^\mathrm{top}}$
- $\textcolor{grey}{\tau_{zz}^\mathrm{top}}$
- $\tau_{yz}^\mathrm{top}$
- $\tau_{xz}^\mathrm{top}$
- $\textcolor{grey}{\tau_{xy}^\mathrm{top}}$

As mentioned above, most of the slots (in gray) are not used, because these stress components are not measured and thus zero. 
The individual training data of a single "MD" run is also stored in a local `dtool` dataset:

In [None]:
import os
for f in sorted(os.listdir(myProblem.pressure.database.training_path)):
    print(os.path.abspath(f))

The training data is stored as metadata in `README.yml`, which can be loaded directly into a list of dicts:

In [None]:
readme_list = myProblem.pressure.database.get_readme_list_local()
readme_list[0]['Y']

The `pre_run()` initialization also runs a first hyperparameter fit of the GP models. 
The output data used in the pressure and shear stress models is fixed, but we can change which features will be used via the `active_dims` settings.
The defaults are density and gap height (`active_dims: [0, 3]`) for pressure, and density, gap height, and flux for shear stress (e.g. `active_dims_x: [0, 1, 3]`).
The models are now ready to make a first prediction:

In [None]:
press_predict = myProblem.pressure.predict(predictor=False)
shear_predict = myProblem.wall_stress_xz.predict(predictor=False)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from GaPFlow.viz.plotting import _plot_sol_from_field_1d

_sx, _sy = plt.rcParams['figure.figsize']
fig, ax = plt.subplots(2, 3, figsize=(2*_sx, 2*_sy))

_plot_sol_from_field_1d(myProblem.q,
                        press_predict[0],
                        shear_predict[0][0],
                        shear_predict[0][1],
                        press_predict[1],
                        shear_predict[1][0],
                        myProblem.pressure.variance_tol,
                        myProblem.wall_stress_xz.variance_tol,
                        ax=ax)

The uncertainty tolerance is controlled via the `atol` and `rtol` parameters.
Here, we see that the pressure prediction is very good (which is expected, as pressure only depends on density and the density is constant in the beginning).
However, the shear stress prediction exceeds the uncertainty tolerance, such that active learning is necessary to augment the database and thereby the fit. Let's run the simulations for a few steps and see how the predictions evolve.

In [None]:
# This is usually what happens within the run() method but we can also trigger it manually
for _ in range(50):
    myProblem.update()

In [None]:
print('Simulation step: ', myProblem.step)
print('Database size: ', myProblem.pressure.database.size)

Let's plot the current prediction again:

In [None]:
fig, ax = plt.subplots(2, 3, figsize=(2*_sx, 2*_sy))
myProblem.plot(ax=ax)

Looks good. We now run the simulation again up to 2500 steps as specified in the input.

In [None]:
myProblem.run(keep_open=True)

The active learning simulation continued to extend the database. Let's check its growth as a function of the simulation time step:

In [None]:
plt.plot(myProblem.pressure.history['step'],
         myProblem.pressure.history['database_size'])

plt.xlabel('Step')
plt.ylabel('Database size');

In [None]:
fig, ax = plt.subplots(2, 3, figsize=(2*_sx, 2*_sy))
myProblem.plot(ax=ax)

After 2500 steps, the simulation results looks close to what we expect for the journal bearing case, but has not yet converged. Feel free to run it for longer (by modifying the `max_it` parameter).