## Case Study: Predicting Species Distributions with Gaussian Processes

In this case study, we apply **Gaussian Process Regression (GPR)** to model and predict **species presence** across space and time. This work was conducted in collaboration with **Xylo Systems**, a biodiversity intelligence platform focused on using AI to support conservation decision-making.

### Goal

The aim is to estimate the **probability of observing a particular species** at any given location and time, using real-world species observation data. This allows us to:

- Fill in spatial and temporal gaps in datasets  
- Quantify uncertainty in under-sampled regions  
- Track species movement and habitat change over time  
- Generate high-resolution, dynamic biodiversity layers to support conservation

### Approach

We treat species occurrence as a continuous spatio-temporal function:
$$
Z(s, t) \sim \mathcal{GP}(\mu(s, t),\ k((s, t), (s', t')))
$$

Where:

- $ s \in \mathbb{R}^2 $ represents spatial coordinates (e.g., latitude and longitude)  
- $ t $ represents time (e.g., month or year)  
- $ \mu(s, t) $ is the mean function (typically zero)  
- $ k $ is a kernel encoding spatial and temporal correlation

#### Data Transformation

Since the original species data are **count-based** (i.e. integer values ≥ 0), we apply a **log1p transformation**:
$$
y = \log(1 + \text{count})
$$

This stabilises variance and improves numerical behaviour for regression. After transformation, we assume a **Gaussian likelihood**:
$$
y_i \sim \mathcal{N}(Z(s_i, t_i),\ \sigma^2)
$$

This formulation allows us to model spatial and temporal structure while maintaining tractability and interpretability.

### Implementation with BayeSpace

Using **BayeSpace**, we:

1. Aggregate and grid the species observation data in space and time  
2. Apply a log1p transformation to the counts  
3. Define a composite kernel:
   - Spatial kernel (e.g. Matérn or RBF) over latitude and longitude  
   - Temporal kernel (e.g. squared exponential or periodic) over time  
   - Combined into a spatio-temporal kernel over $(x, y, t)$

4. Train the GPR model to fit the transformed data  
5. Predict mean and uncertainty over the full domain

### Outcome

The result is a **continuous surface of predicted species presence** over space and time, visualised as:

- Spatial distribution maps at selected time points  
- Temporal trends at fixed locations  
- Uncertainty maps indicating model confidence

This information helps identify habitat hotspots, understand seasonal dynamics, and guide conservation priorities in a data-scarce environment.


### Install Libraries

In [None]:
from app.visualisation_toolbox.domain import Domain
from app.visualisation_toolbox.visualiser import GPVisualiser

from app.gaussian_process_toolbox.kernel import Kernel
from app.gaussian_process_toolbox.transformation import Transformation
from app.gaussian_process_toolbox.gaussian_processor import GP

from app.data_processing.raw_data_processor import RawDataProcessor

import numpy as np
import os
import jax

os.chdir('/PhD_project/')
jax.config.update("jax_enable_x64", True)


In [None]:
processor_params = {
    'identifier': 'aves',
    'tax_group': 'class',
    'num_x_cells': 10,
    'num_y_cells': 10,
    'timestep': 'year',
    'output_header': 'counts'
}

raw_data_processor = RawDataProcessor('xylo_test_project', 'birds_in_melbourne', 'XYLO_processor', processor_params)
processed_data, _ = raw_data_processor.process_data()
x_min = processed_data['x'].min()
x_max = processed_data['x'].max()
y_min = processed_data['y'].min()
y_max = processed_data['y'].max()
t_min = processed_data['t'].min()
t_max = processed_data['t'].max()

kernel_config = {('matern', 'xy'): [0, 1], ('rbf', 't'): [2]}
kernel_obj = Kernel(kernel_config) \
    .add_kernel_param('matern', 'xy', 'length_scale', [1.5, 1.5]) \
    .add_kernel_param('matern', 'xy', 'nu', 1.5) \
    .add_kernel_param('matern', 'xy', 'length_scale_bounds', (0.1, 10)) \
    .add_kernel_param('rbf', 't', 'length_scale', 2.0) \
    .add_kernel_param('rbf', 't', 'length_scale_bounds', (0.1, 10))

uncertainty_params = {
    'precision_error': 0.1,
    'zero_count_error': 1
}
gp = GP(raw_data_processor, 
        kernel_obj, 
        transformation=Transformation('log1p'), 
        uncertainty_method='XYLO_uncertainty', 
        uncertainty_params=uncertainty_params)

gp_model = gp.train()

vis_domain = Domain(2, 'rectangular', time_array=np.arange(t_min, t_max, 1), dim_names=['x', 'y', 't']) \
    .add_domain_param('min_x', x_min).add_domain_param('max_x', x_max).add_domain_param('n_points_x', 30) \
    .add_domain_param('min_y', y_min).add_domain_param('max_y', y_max).add_domain_param('n_points_y', 30)

vis_domain.build_domain()

visualiser = GPVisualiser(gp)
visualiser.show_predictions(vis_domain, 'test', title='GP Test 2D Time', plot_type='2D_time')