# Global Daily SST Analysis: Identifying Marine Extremes with `MarEx-Detect`

### `MarEx-Detect` Processing Pipeline:

1. **Anomaly Generation**
   - Removes polynomial trends (user-configurable orders)
   - Eliminates seasonal cycle via annual and semi-annual harmonics
   - Optionally standardises by day-of-year temporal variability

2. **Extreme Event Identification**
   - Computes adaptive local thresholds using percentile-based approach
   - Creates boolean masks identifying extreme events
   - Uses histogram-based approximation for efficiency on large datasets

3. **Results Assembly**
   - Attaches spatial metadata (connectivity, cell areas) if provided
   - Optimises chunking for subsequent analyses

The pipeline leverages **dask** for distributed parallel computation and **flox** for optimised groupby operations, enabling efficient processing of large datasets. \
A 40-year global daily analysis at 0.25° resolution completes in ~2 minutes on 128 cores.

In [None]:
import xarray as xr
import numpy as np
import dask
import intake
from getpass import getuser
from pathlib import Path

import marEx
import marEx.helper as hpc

In [None]:
# Start Dask Cluster
client = hpc.start_local_cluster(n_workers=64, n_threads=2)

In [None]:
# Import 40 years of Daily ICON data (ref. EERIE project)

cat = intake.open_catalog("https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/eerie.yaml")
expid = 'eerie-control-1950'
version = 'v20240618'
model = 'icon-esm-er'
gridspec = 'gr025'

dat = cat['dkrz.disk.model-output'][model][expid][version]['ocean'][gridspec]

In [None]:
# Choose optimal chunk size & load data
#   N.B.: This is crucial for dask (not only for performance, but also to make the problem tractable)
#         The operations in this package eventually require global-in-time operations,
#         therefore, a larger time chunksize is beneficial.

time_chunksize = 1000
sst = dat['2d_daily_mean'](chunks={}).to_dask().to.isel(depth=0).drop_vars('depth').chunk({'time':time_chunksize, 'lat':'auto', 'lon':'auto'})

In [None]:
# Ensure that the chunks are appropriately-sized
#  N.B.: The intermediate chunk size is the global-in-time memory footprint
#        It is good for each worker to have a few times more memory than this value

chunk_shape = sst.data.chunksize
intermediate_chunk_shape = (sst.sizes['time'],) + chunk_shape[1:]

print(f"Data Chunking (time, lat, lon): {chunk_shape}")
print(f"Initial Chunk Size: {np.prod(chunk_shape) * sst.data.dtype.itemsize / (1024**2):.2f} MB")
print(f"Intermediate Chunk Size: {np.prod(intermediate_chunk_shape) * sst.data.dtype.itemsize / (1024**2):.2f} MB")

In [None]:
# Process Data using `MarEx-Detect` helper functions:

extremes_ds = marEx.preprocess_data(sst, 
                                std_normalise = False,        # Don't Standardise the data (this is standard)
                                threshold_percentile = 95,    # Use the 95th percentile as the extremes threshold
                                detrend_orders = [1, 2],      # Detrend the data using 1st & 2nd order polynomials (in addition to removing the mean & seasonal cycle/sub-cycle)
                                dimensions = {'time':'time',
                                              'xdim':'lon',
                                              'ydim':'lat'},  # Define the dimensions of the data -- if 'ydim' exists, then MarEx-Detect knows this is a gridded dataset
                                dask_chunks = {'time': 25})   # Dask chunks for *output* data (this is much smaller than the input chunks because the Tracking/ID is more memory-intensive)
extremes_ds

In [None]:
# Save Extremes Data to `zarr` for more efficient parallel I/O

file_name = Path('/scratch') / getuser()[0] / getuser() / 'mhws' / 'extremes_binary_gridded.zarr'
extremes_ds.to_zarr(file_name, mode='w')