# Global Daily SST Analysis: Identifying Marine Extremes with `MarEx-Detect`

### `MarEx-Detect` Processing Pipeline:

1. **Anomaly Generation**
   - Removes polynomial trends (user-configurable orders)
   - Eliminates seasonal cycle via annual and semi-annual harmonics
   - Optionally standardises by day-of-year temporal variability

2. **Extreme Event Identification**
   - Computes adaptive local thresholds using percentile-based approach
   - Creates boolean masks identifying extreme events
   - Uses histogram-based approximation for efficiency on large datasets

3. **Results Assembly**
   - Attaches spatial metadata (connectivity, cell areas) if provided
   - Optimises chunking for subsequent analyses

The pipeline leverages **dask** for distributed parallel computation and **flox** for optimised groupby operations, enabling efficient processing of large datasets. \
A 40-year global daily analysis at 5km resolution on the _unstructured grid_ (15 million cells) completes in ~10 minutes on 512 cores.

In [1]:
import xarray as xr
import numpy as np
import dask
import intake
from getpass import getuser
from pathlib import Path

import marEx
import marEx.helper as hpc

In [None]:
# Lustre Scratch Directory
scratch_dir = Path('/scratch') / getuser()[0] / getuser()

In [3]:
# Start Distributed Dask Cluster
client = hpc.start_distributed_cluster(n_workers=512, workers_per_node=64, runtime=29, node_memory=512,
                                 scratch_dir = scratch_dir / 'clients')  # Specify temporary scratch directory for dask to use

Dask Scratch: '/scratch/b/b382615/clients/tmpsvja5opp'
Memory per Worker: 8.00 GB
Hostname: l40204
Forward Port: l40204:8889
Dashboard Link: localhost:8889/status


In [4]:
# Import 40 years of Daily Native-Grid ICON data (ref. EERIE project)

cat = intake.open_catalog("https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/eerie.yaml")
expid = 'eerie-control-1950'
version = 'v20240618'
model = 'icon-esm-er'
gridspec = 'native'

dat = cat['dkrz.disk.model-output'][model][expid][version]['ocean'][gridspec]

In [None]:
# Choose optimal chunk size & load data
#   N.B.: This is crucial for dask (not only for performance, but also to make the problem tractable)
#         The operations in this package eventually require global-in-time operations,
#         therefore, a larger time chunksize is beneficial.

time_chunksize = 200
sst = dat['2d_daily_mean'](chunks={}).to_dask().to.isel(depth=0).drop_vars({'depth','cell_sea_land_mask'}).chunk({'time':time_chunksize, 'ncells':'auto'})

In [6]:
# Ensure that the chunks are appropriately-sized
#  N.B.: The intermediate chunk size is the global-in-time memory footprint
#        It is good for each worker to have a few times more memory than this value

chunk_shape = sst.data.chunksize
intermediate_chunk_shape = (sst.sizes['time'],) + chunk_shape[1:]

print(f"Data Chunking (time, ncells): {chunk_shape}")
print(f"Initial Chunk Size: {np.prod(chunk_shape) * sst.data.dtype.itemsize / (1024**2):.2f} MB")
print(f"Intermediate Chunk Size: {np.prod(intermediate_chunk_shape) * sst.data.dtype.itemsize / (1024**2):.2f} MB")

Data Chunking (time, ncells): (200, 167772)
Initial Chunk Size: 128.00 MB
Intermediate Chunk Size: 11687.67 MB


In [None]:
# Load the grid & neighbours

grid2d = dat['2d_grid'](chunks={}).to_dask().rename({'cell':'ncells'})
neighbours = grid2d.neighbor_cell_index.rename({'clat':'lat', 'clon':'lon'})
areas = grid2d.cell_area.rename({'clat':'lat', 'clon':'lon'})

In [None]:
# Process Data using `MarEx Detect` helper functions:

extremes_ds = marEx.preprocess_data(sst, 
                                std_normalise = False,            # Don't Standardise the data (this is standard)
                                threshold_percentile = 95,        # Use the 95th percentile as the extremes threshold
                                detrend_orders = [1, 2],          # Detrend the data using 1st & 2nd order polynomials (in addition to removing the mean & seasonal cycle/sub-cycle)
                                exact_percentile = False,         # Use a histogram-based method to estimate the percentile value (within 0.025C)
                                dask_chunks = {'time': 2},        # Dask chunks for *output* data (this is much smaller than the input chunks because the Tracking/ID is more memory-intensive)
                                neighbours = neighbours,          # Pass information about neighbours to be used in subsequent processing
                                cell_areas = areas,               # Pass information about each Unstructured Grid's cell area (in metres) to be used in subsequent processing
                                dimensions = {'time':'time', 
                                              'xdim':'ncells'})   # Not specifying 'ydim' tells MarEx-Detect that it is an Unstructured Grid
extremes_ds

Unnamed: 0,Array,Chunk
Bytes,113.57 MiB,113.57 MiB
Shape,"(14886338,)","(14886338,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 113.57 MiB 113.57 MiB Shape (14886338,) (14886338,) Dask graph 1 chunks in 2 graph layers Data type float64 numpy.ndarray",14886338  1,

Unnamed: 0,Array,Chunk
Bytes,113.57 MiB,113.57 MiB
Shape,"(14886338,)","(14886338,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,113.57 MiB,113.57 MiB
Shape,"(14886338,)","(14886338,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 113.57 MiB 113.57 MiB Shape (14886338,) (14886338,) Dask graph 1 chunks in 2 graph layers Data type float64 numpy.ndarray",14886338  1,

Unnamed: 0,Array,Chunk
Bytes,113.57 MiB,113.57 MiB
Shape,"(14886338,)","(14886338,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,56.79 MiB,56.79 MiB
Shape,"(14886338,)","(14886338,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 56.79 MiB 56.79 MiB Shape (14886338,) (14886338,) Dask graph 1 chunks in 2 graph layers Data type float32 numpy.ndarray",14886338  1,

Unnamed: 0,Array,Chunk
Bytes,56.79 MiB,56.79 MiB
Shape,"(14886338,)","(14886338,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,0.99 TiB,227.15 MiB
Shape,"(18262, 14886338)","(4, 14886338)"
Dask graph,4566 chunks in 2 graph layers,4566 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 0.99 TiB 227.15 MiB Shape (18262, 14886338) (4, 14886338) Dask graph 4566 chunks in 2 graph layers Data type float32 numpy.ndarray",14886338  18262,

Unnamed: 0,Array,Chunk
Bytes,0.99 TiB,227.15 MiB
Shape,"(18262, 14886338)","(4, 14886338)"
Dask graph,4566 chunks in 2 graph layers,4566 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,253.18 GiB,56.79 MiB
Shape,"(18262, 14886338)","(4, 14886338)"
Dask graph,4566 chunks in 2 graph layers,4566 chunks in 2 graph layers
Data type,bool numpy.ndarray,bool numpy.ndarray
"Array Chunk Bytes 253.18 GiB 56.79 MiB Shape (18262, 14886338) (4, 14886338) Dask graph 4566 chunks in 2 graph layers Data type bool numpy.ndarray",14886338  18262,

Unnamed: 0,Array,Chunk
Bytes,253.18 GiB,56.79 MiB
Shape,"(18262, 14886338)","(4, 14886338)"
Dask graph,4566 chunks in 2 graph layers,4566 chunks in 2 graph layers
Data type,bool numpy.ndarray,bool numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,14.20 MiB,14.20 MiB
Shape,"(14886338,)","(14886338,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,bool numpy.ndarray,bool numpy.ndarray
"Array Chunk Bytes 14.20 MiB 14.20 MiB Shape (14886338,) (14886338,) Dask graph 1 chunks in 2 graph layers Data type bool numpy.ndarray",14886338  1,

Unnamed: 0,Array,Chunk
Bytes,14.20 MiB,14.20 MiB
Shape,"(14886338,)","(14886338,)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,bool numpy.ndarray,bool numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,170.36 MiB,170.36 MiB
Shape,"(3, 14886338)","(3, 14886338)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,int32 numpy.ndarray,int32 numpy.ndarray
"Array Chunk Bytes 170.36 MiB 170.36 MiB Shape (3, 14886338) (3, 14886338) Dask graph 1 chunks in 2 graph layers Data type int32 numpy.ndarray",14886338  3,

Unnamed: 0,Array,Chunk
Bytes,170.36 MiB,170.36 MiB
Shape,"(3, 14886338)","(3, 14886338)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,int32 numpy.ndarray,int32 numpy.ndarray


In [None]:
# Save data to `zarr` for more efficient parallel I/O

file_name = scratch_dir / 'mhws' / 'extremes_binary_unstruct.zarr'
extremes_ds.to_zarr(file_name, mode='w')