# Gaussian Process Regression with BayeSpace Using Simulated Data

This notebook demonstrates the use of **Gaussian Process Regression (GPR)** with BayeSpace on a series of simulated datasets, increasing in complexity. The goal is to showcase how BayeSpace can model complex functions and spatial patterns using GPR — a flexible, non-parametric method that infers function values and uncertainty without assuming a specific model form.

We will walk through the following examples:

- A simple **line** (1D regression)
- A **curve** (non-linear polynomial)
- A **plane** (2D regression)
- A **non-linear** 2D function

For each example, we generate noisy data, define a kernel structure, and train a Gaussian Process model using BayeSpace. We then visualise the predicted function values and uncertainties over the domain.

This notebook serves as both a tutorial and a showcase of BayeSpace’s GPR functionality across increasingly complex regression problems — from smooth trends to sharp non-linearities.


### Install Libraries

In [None]:
from app.regression_toolbox.model import Model, add_model, delete_model

from app.visualisation_toolbox.domain import Domain
from app.visualisation_toolbox.visualiser import GPVisualiser

from app.data_processing.sim_data_processor import SimDataProcessor

from app.gaussian_process_toolbox.kernel import Kernel
from app.gaussian_process_toolbox.gaussian_processor import GP
from app.gaussian_process_toolbox.transformation import Transformation

import os
import jax

os.chdir('/PhD_project/')
jax.config.update("jax_enable_x64", True)


## Example 1: Gaussian Process Regression on a Line

In this first example, we use BayeSpace to perform **Gaussian Process Regression (GPR)** on simulated data generated from a simple line:  
$$
f(x) = ax + b
$$

The true values used to generate the data are $ a = 1 $, $ b = 1 $, with Gaussian noise of standard deviation 1 added to simulate measurement uncertainty.

Unlike Bayesian Regression, which infers explicit parameter values for $ a $, $ b $, and $ \sigma $, GPR directly models the function $ f(x) $ as a distribution over possible functions, conditioned on the observed data. We use a **Matérn kernel** defined over the 1D input space $ x $, with length scale and smoothness hyperparameters optimised during training.

We visualise the GP’s predictive mean and uncertainty across the domain, demonstrating how GPR captures both the trend and confidence of the inferred function — even for a simple linear case.

This example serves as a baseline for understanding BayeSpace’s GPR capability on well-behaved, 1D data.


In [2]:
# Add this line if model doesn't exist yet
# add_model('line', 'a*x + b', ['x'], 'y', ['a', 'b'])

# Define the true model for simulation: a line with a = 1, b = 1
sim_model = Model('line').add_fixed_model_param('a', 1).add_fixed_model_param('b', 1)

# Define the input domain for the simulation: 50 points from 0 to 100
sim_domain = Domain(1, 'linear').add_domain_param('min', 0).add_domain_param('max', 100).add_domain_param('n_points', 50)
sim_domain.build_domain()

# Generate noisy data using the model and domain, with Gaussian noise
sim_data_processor = SimDataProcessor('linear_example', sim_model, sim_domain, noise_dist='gaussian', noise_level=1)

kernel_config = {('matern', 'x'): [0]} 

# Instantiate the kernel with hyperparameters
kernel_obj = Kernel(kernel_config)
kernel_obj.add_kernel_param('matern', 'x', 'length_scale', 1)           # Initial guess
kernel_obj.add_kernel_param('matern', 'x', 'nu', 2.5)                              # Smoothness
kernel_obj.add_kernel_param('matern', 'x', 'length_scale_bounds', (0.001, 100))     # Bounds for optimisation

# Identity transformation — data remains in log-space
transformation = Transformation('identity')

# Initialise the GP model using the real-world data processor
gp = GP(sim_data_processor, kernel_obj, transformation=transformation, uncertainty_method='constant', uncertainty_params={'constant_error':1})

# Train the GP — fit hyperparameters and compute posterior
gp_model = gp.train()

# Visualise traceplots and autocorrelations for diagnostics
visualiser = GPVisualiser(gp)

# Visualise predicted line with posterior uncertainty
vis_domain = Domain(1, 'linear').add_domain_param('min', 0).add_domain_param('max', 100).add_domain_param('n_points', 100)
vis_domain.build_domain()
visualiser.show_predictions(sim_domain, 'predictions', '1D')

Data loaded from /PhD_project/data/processed_sim_data/linear_example
Plot saved at: /PhD_project/data/processed_sim_data/linear_example
Loading existing GP model from /PhD_project/results/gaussian_process_results/linear_example/instance_1/gaussian_process_model.pkl


## Example 2: Gaussian Process Regression on a Polynomial Curve

In this example, we use BayeSpace’s Gaussian Process Regression (GPR) capabilities to model the same second-degree polynomial relationship:

$$
f(x) = ax^2 + bx + c
$$

Instead of explicitly parameterising the polynomial, we treat the function as an unknown process and place a **Matern kernel** over the input space to infer its structure non-parametrically. The simulated dataset remains the same as in the Bayesian regression case, with true values $ a = 1.8 $, $ b = 2.8 $, and $ c = 1.4 $, and Gaussian noise with standard deviation $ \sigma = 1 $.

The Matern kernel used has a smoothness parameter $ \nu = 2.5 $, allowing for flexible yet relatively smooth functions. Hyperparameters such as the length scale are inferred during training, while a constant observational error of 1 is assumed.

This example highlights the flexibility of GPR to model complex functional forms without requiring an explicit equation, making it especially useful when the underlying structure is unknown or difficult to express analytically.


In [3]:
# Add this line if the model doesn't exist yet
# add_model('polynomial', 'a*x**2 + b*x + c', ['x'], 'y', ['a', 'b', 'c'])

# Step 1: Define the true model and generate synthetic data
sim_model = Model('polynomial').add_fixed_model_param('a', 1.8).add_fixed_model_param('b', 2.8).add_fixed_model_param('c', 1.4)

sim_domain = Domain(1, 'linear').add_domain_param('min', -3).add_domain_param('max', 3).add_domain_param('n_points', 100)
sim_domain.build_domain()

sim_data_processor = SimDataProcessor('polynomial_example', sim_model, sim_domain, noise_dist='gaussian', noise_level=1)

# Step 2: Define a Matern kernel for GPR
kernel_config = {('matern', 'x'): [0]}
kernel_obj = Kernel(kernel_config)
kernel_obj.add_kernel_param('matern', 'x', 'length_scale', 1)
kernel_obj.add_kernel_param('matern', 'x', 'nu', 2.5)
kernel_obj.add_kernel_param('matern', 'x', 'length_scale_bounds', (0.001, 100))

# Step 3: Apply identity transformation
transformation = Transformation('identity')

# Step 4: Fit the GP model
gp = GP(sim_data_processor, kernel_obj, transformation=transformation, uncertainty_method='constant', uncertainty_params={'constant_error': 1})
gp_model = gp.train()

# Step 5: Visualise predictions
visualiser = GPVisualiser(gp)

vis_domain = Domain(1, 'linear').add_domain_param('min', -3).add_domain_param('max', 3).add_domain_param('n_points', 100)
vis_domain.build_domain()

visualiser.show_predictions(vis_domain, 'predictions', '1D')


Data loaded from /PhD_project/data/processed_sim_data/polynomial_example
Plot saved at: /PhD_project/data/processed_sim_data/polynomial_example
Loading existing GP model from /PhD_project/results/gaussian_process_results/polynomial_example/instance_1/gaussian_process_model.pkl


## Example 3: Gaussian Process Regression on a Plane

In this example, we use BayeSpace to apply Gaussian Process Regression (GPR) on data simulated from a plane:

$$
f(x, y) = ax + by
$$

The data are generated on a 2D grid from $ -3 $ to $ 3 $ in both $ x $ and $ y $, using true parameter values $ a = 1 $ and $ b = 2 $, with added Gaussian noise of standard deviation $ \sigma = 1 $.

Instead of explicitly parameterizing $ a $ and $ b $ as in Bayesian regression, we model the surface as a Gaussian Process with a **Matern kernel** in both $ x $ and $ y $. This kernel accounts for spatial structure and smoothness in the function, allowing flexible, non-parametric modelling of the plane surface.

After fitting the GP, we visualise the predictive surface with confidence intervals to assess the model’s performance. This example demonstrates BayeSpace’s ability to generalise beyond parametric forms and effectively model multivariate input domains with spatial correlation.


In [4]:
# Add this line if model doesn't exist yet
# add_model('plane', 'a*x + b*y', ['x', 'y'], 'C', ['a', 'b'])

# Step 1: Define the true model and generate synthetic data
sim_model = Model('plane').add_fixed_model_param('a', 1).add_fixed_model_param('b', 2)

sim_domain = Domain(2, 'rectangular').add_domain_param('min_x', -3)\
                                     .add_domain_param('max_x', 3)\
                                     .add_domain_param('n_points_x', 20)\
                                     .add_domain_param('min_y', -3)\
                                     .add_domain_param('max_y', 3)\
                                     .add_domain_param('n_points_y', 20)
sim_domain.build_domain()

sim_data_processor = SimDataProcessor('plane_example', sim_model, sim_domain, noise_dist='gaussian', noise_level=1)

# Step 2: Define a Matern kernel in x and y
kernel_config = {('matern', 'xy'): [0, 1]}
kernel_obj = Kernel(kernel_config)
kernel_obj.add_kernel_param('matern', 'xy', 'length_scale', [1,1])
kernel_obj.add_kernel_param('matern', 'xy', 'nu', 2.5)
kernel_obj.add_kernel_param('matern', 'xy', 'length_scale_bounds', (0.001, 100))

# Step 3: Identity transformation for direct modelling
transformation = Transformation('identity')

# Step 4: Train the GP model
gp = GP(sim_data_processor, kernel_obj, transformation=transformation, uncertainty_method='constant', uncertainty_params={'constant_error': 1})
gp_model = gp.train()

# Step 5: Visualise the predictions
visualiser = GPVisualiser(gp)

vis_domain = Domain(2, 'rectangular').add_domain_param('min_x', -3)\
                                     .add_domain_param('max_x', 3)\
                                     .add_domain_param('n_points_x', 100)\
                                     .add_domain_param('min_y', -3)\
                                     .add_domain_param('max_y', 3)\
                                     .add_domain_param('n_points_y', 100)
vis_domain.build_domain()

visualiser.show_predictions(vis_domain, 'predictions', '2D')


Data loaded from /PhD_project/data/processed_sim_data/plane_example
Plot saved at: /PhD_project/data/processed_sim_data/plane_example
Loading existing GP model from /PhD_project/results/gaussian_process_results/plane_example/instance_1/gaussian_process_model.pkl


## Example 4: Gaussian Process Regression on a Non-Linear 2D Function

In this final example, we use BayeSpace to perform Gaussian Process Regression (GPR) on a complex, non-linear function:

$$
f(x, y) = \frac{\sin(x)}{y + a} + \frac{1}{b + x^2}
$$

The function introduces significant non-linearity and potential instability due to the division by $ y + a $. The true parameters used to simulate the data are $ a = 2 $ and $ b = 3 $, with Gaussian noise of standard deviation 1. We generate data on a grid over $ x, y \in [0, 10] $ using 40 points in each direction.

To model this surface, we use a **Matern kernel** for each input dimension, which provides flexibility and smoothness while remaining robust to sharp changes in curvature. GPR is especially suited to this kind of problem, where the function is non-linear and potentially sensitive to small changes in input.

After training, we visualise the GP’s prediction surface along with uncertainty, highlighting BayeSpace’s capability to model noisy, sensitive systems using non-parametric methods.


In [5]:
# Add this line if model doesn't exist yet
# add_model('nonlinear_2D', 'sin(x)/(y+a) + 1/(b+x^2)', ['x', 'y'], 'C', ['a', 'b'])

# Step 1: Define the true model
sim_model = Model('nonlinear_2D').add_fixed_model_param('a', 2).add_fixed_model_param('b', 3)

# Step 2: Define 2D rectangular domain
sim_domain = Domain(2, 'rectangular')\
    .add_domain_param('min_x', 0)\
    .add_domain_param('max_x', 10)\
    .add_domain_param('min_y', 0)\
    .add_domain_param('max_y', 10)\
    .add_domain_param('n_points_x', 40)\
    .add_domain_param('n_points_y', 40)
sim_domain.build_domain()

# Step 3: Generate noisy data
sim_data_processor = SimDataProcessor('nonlinear_example', sim_model, sim_domain, noise_dist='gaussian', noise_level=1)

# Step 4: Define Matern kernel in x and y
kernel_config = {('matern', 'xy'): [0, 1]}
kernel_obj = Kernel(kernel_config)
kernel_obj.add_kernel_param('matern', 'xy', 'length_scale', [1,1])
kernel_obj.add_kernel_param('matern', 'xy', 'nu', 2.5)
kernel_obj.add_kernel_param('matern', 'xy', 'length_scale_bounds', (0.001, 100))

# Step 5: Use identity transformation
transformation = Transformation('identity')

# Step 6: Train GP model
gp = GP(sim_data_processor, kernel_obj, transformation=transformation, uncertainty_method='constant', uncertainty_params={'constant_error': 1})
gp_model = gp.train()

# Step 7: Create high-res prediction domain for plotting
vis_domain = Domain(2, 'rectangular')\
    .add_domain_param('min_x', 0)\
    .add_domain_param('max_x', 10)\
    .add_domain_param('min_y', 0)\
    .add_domain_param('max_y', 10)\
    .add_domain_param('n_points_x', 100)\
    .add_domain_param('n_points_y', 100)
vis_domain.build_domain()

# Step 8: Visualise predicted surface with uncertainty
visualiser = GPVisualiser(gp)
visualiser.show_predictions(vis_domain, 'predictions', '2D')


Data generated and saved to /PhD_project/data/processed_sim_data/nonlinear_example
Plot saved at: /PhD_project/data/processed_sim_data/nonlinear_example
Fitted new GP model and saving to /PhD_project/results/gaussian_process_results/nonlinear_example/instance_1/gaussian_process_model.pkl


In [None]:
# add_model('sin', 'A * sin(B * x + C) + D', ['x'], 'y', ['A', 'B', 'C', 'D'])

sim_model = Model('sin').add_fixed_model_param('A', 1).add_fixed_model_param('B', 1).add_fixed_model_param('C', 0).add_fixed_model_param('D', 0)
sim_domain = Domain(1, 'linear').add_domain_param('min', 0).add_domain_param('max', 10).add_domain_param('n_points', 50)
sim_domain.build_domain()
sim_data_processor = SimDataProcessor('gp_test_10', sim_model, sim_domain, noise_dist='gaussian', noise_percentage=0.5)
sim_data_processor.process_data()

Data loaded from /PhD_project/data/processed_sim_data/gp_test_10
Plot saved at: /PhD_project/data/processed_sim_data/gp_test_10


[    Unnamed: 0          x         y    y_true
 12          12   2.448980  0.729252  0.638550
 4            4   0.816327  0.835853  0.728635
 37          37   7.551020  0.798296  0.954457
 8            8   1.632653  0.858149  0.998087
 3            3   0.612245  0.531819  0.574706
 6            6   1.224490  0.771494  0.940633
 41          41   8.367347  0.948427  0.871097
 46          46   9.387755  0.036783  0.037014
 47          47   9.591837 -0.177167 -0.166283
 15          15   3.061224  0.080543  0.080282
 9            9   1.836735  1.070700  0.964846
 16          16   3.265306 -0.121329 -0.123398
 24          24   4.897959 -1.289284 -0.982831
 34          34   6.938776  0.658580  0.609627
 31          31   6.326531  0.043978  0.043332
 0            0   0.000000  0.000000  0.000000
 44          44   8.979592  0.450965  0.430626
 27          27   5.510204 -0.714968 -0.698272
 33          33   6.734694  0.431780  0.436323
 5            5   1.020408  0.459608  0.852322
 29          

In [None]:
# add_model('plane', 'a*x + b*y +c', ['x', 'y'], 'C', ['a', 'b', 'c'])

sim_model = Model('plane').add_fixed_model_param('a', 1).add_fixed_model_param('b', 1).add_fixed_model_param('c', 1)
sim_domain = Domain(2, 'circular').add_domain_param('radius', 5).add_domain_param('mass', 5000)
sim_domain.build_domain()
sim_data_processor = SimDataProcessor('gp_test_11', sim_model, sim_domain, noise_dist='gaussian', noise_level = 1)
sim_data_processor.process_data()

Data loaded from /PhD_project/data/processed_sim_data/gp_test_11
Plot saved at: /PhD_project/data/processed_sim_data/gp_test_11


[      Unnamed: 0         x         y         C    C_true
 3290        3290  0.514582  0.969710  2.913460  2.484293
 2333        2333 -0.738732 -3.480437 -2.121768 -3.219169
 4553        4553  2.394553 -4.131678 -0.282135 -0.737125
 3168        3168  0.389251 -2.395035 -1.621613 -1.005785
 2760        2760 -0.174741  3.249054  4.076328  4.074314
 ...          ...       ...       ...       ...       ...
 3772        3772  1.203905  4.117376  6.912167  6.321281
 5191        5191  3.459870  0.427009  5.277960  4.886880
 5226        5226  3.585202 -2.937736  1.945506  1.647465
 5390        5390  3.835865  1.295331  4.455971  6.131196
 860          860 -2.932032  1.946572  1.054759  0.014540
 
 [4613 rows x 5 columns],
       Unnamed: 0         x         y         C    C_true
 3262        3262  0.577248 -2.069415  0.260225 -0.492167
 4577        4577  2.394553 -1.526714 -0.356621  1.867839
 2213        2213 -0.926729  3.249054  3.164142  3.322325
 3226        3226  0.451916  3.900295  6.065

In [None]:
# add_model('3D_plane', 'a*x + b*y +c*z + d', ['x', 'y', 'z'], 'C', ['a', 'b', 'c', 'd'])

sim_model = Model('3D_plane').add_fixed_model_param('a', 1).add_fixed_model_param('b', 1).add_fixed_model_param('c', 1).add_fixed_model_param('d', 1)
sim_domain = Domain(3, 'cylindrical').add_domain_param('radius', 5).add_domain_param('mass', 5000).add_domain_param('height', 10)
sim_domain.build_domain()
sim_data_processor = SimDataProcessor('gp_test_12', sim_model, sim_domain, noise_dist='gaussian', noise_percentage=  0.2)
sim_data_processor.process_data()

Data loaded from /PhD_project/data/processed_sim_data/gp_test_12
Plot saved at: /PhD_project/data/processed_sim_data/gp_test_12


[      Unnamed: 0         x         y         z         C     C_true
 1600        1600 -3.651099  1.541821 -2.663636 -4.275061  -3.772914
 90            90  0.665383 -0.093635 -5.000000 -3.403425  -3.428252
 5997        5997  3.632964 -2.663636  4.812731  6.854899   6.782060
 5642        5642 -2.571979  3.410912  3.878185  5.065299   5.717118
 3946        3946  1.744503  3.410912  1.074548  7.514325   7.229963
 ...          ...       ...       ...       ...       ...        ...
 3772        3772  1.204943 -1.261817  1.074548  1.976058   2.017674
 5191        5191  4.442305  1.074548  3.410912  7.543197   9.927764
 5226        5226  4.442305  2.009093  3.410912  4.211169  10.862310
 5390        5390 -2.302199 -3.598181  3.878185 -0.993752  -1.022195
 860          860 -2.032419 -3.130908 -3.598181 -9.237878  -7.761508
 
 [5204 rows x 6 columns],
       Unnamed: 0         x         y         z         C     C_true
 5638        5638  3.093404  2.943639  3.878185  9.687199  10.915228
 3153 

In [None]:
delete_model('exp_test')
add_model('exp_test', '(A * sin(x/a + omega * t) * cos(y/b + nu * t) + B * exp(-(x**2 + y**2)/c**2)) * exp(-alpha * t)', ['x', 'y', 't'], 'C', ['A', 'B', 'a', 'b', 'c', 'omega', 'nu', 'alpha'])

sim_model = Model('exp_test').add_fixed_model_param('A', 1).add_fixed_model_param('B', 1).add_fixed_model_param('a', 1).add_fixed_model_param('b', 1).add_fixed_model_param('c', 1).add_fixed_model_param('alpha', 1).add_fixed_model_param('omega', 1).add_fixed_model_param('nu', 1)
sim_domain = Domain(2, 'circular', dim_names=['x', 'y', 't'], time_array=[0, 1, 2]).add_domain_param('radius', 5).add_domain_param('mass', 2000)
sim_domain.build_domain()
sim_data_processor = SimDataProcessor('gp_test_13', sim_model, sim_domain, noise_dist='gaussian', noise_level = 0.5)
sim_data_processor.process_data()

Data generated and saved to /PhD_project/data/processed_sim_data/gp_test_13
Plot saved at: /PhD_project/data/processed_sim_data/gp_test_13


[             x         y  t         C    C_true
 3196 -0.838506 -4.485149  1 -0.155656 -0.055696
 3071 -1.333922  3.409238  1  0.055979  0.035995
 6662  3.422071 -0.537955  2  0.075841 -0.011142
 5582 -0.640340 -0.023104  2 -0.248044  0.037490
 969  -0.541257 -1.224424  0  0.082949 -0.008311
 ...        ...       ... ..       ...       ...
 3772  1.242240 -4.656766  1 -0.200068 -0.250639
 5191 -2.027505  0.148513  2  0.117006  0.004203
 5226 -1.829338 -3.112212  2 -0.099339  0.010175
 5390 -1.135756 -3.627063  2 -0.344071 -0.005789
 860  -1.036673 -0.366338  0 -0.128389 -0.505075
 
 [5527 rows x 5 columns],
              x         y  t         C    C_true
 5665 -0.442174  4.267324  2  0.173385  0.135307
 994  -0.541257  3.066004  0  0.328946  0.513804
 6061  1.044074  2.722770  2 -0.454100  0.000164
 5711 -0.244007  2.207918  2  0.082971 -0.063321
 4094  2.233072  3.752472  1 -0.040791 -0.001347
 ...        ...       ... ..       ...       ...
 3918  1.539490  1.178216  1  0.236763 -0