# Global Daily Event Analysis: Marine Heatwave ID & Tracking using `MarEx`

### `MarEx` Processing Pipeline for Gridded Datasets:

1. **Morphological Pre-Processing**
    - Performs binary morphological closing using `dask_image.ndmorph` to fill small spatial holes up to `R_fill` cells in radius 
    - Executes binary opening to remove isolated small features of order `R_fill`
    - Fills gaps in time to maintain event continuity for interruptions up to `T_fill` time steps
    - Filters out objects smaller than `area_filter_absolute` cells

2. **Blob Identification**
    - Labels spatially connected components using efficient connected-component algorithm in `dask_image.ndmeasure`
    - Computes blob properties (area, centroid, boundaries)

3. **Temporal Tracking**
    - Identifies blob overlaps between consecutive time frames
    - Connects objects across time, applying the following criteria for splitting, merging, & persistence:
        - Connected objects must overlap by at least fraction `overlap_threshold` of the smaller area
        - Merged objects retain their original ID, but partition the child area based on the parent of the _nearest-neighbour_ cell (or centroid distance)

4. **Graph Reduction & Finalisation**
    - Constructs the complete temporal graph of object evolution through time
    - Resolves object connectivity graph using `scipy.sparse.csgraph.connected_components`
    - Creates globally unique IDs for each tracked extreme event
    - Maps objects into efficient ID-time space for convenient analysis
    - Computes comprehensive statistics about the lifecycle of each event

The pipeline leverages **dask** for distributed parallel computation, enabling efficient processing of large datasets. \
A 40-year global daily (OSTIA) analysis at 0.25° resolution on 32 cores takes
- Basic (i.e. Scannell et al., which involves no merge/split criteria or tracking):  ~5 minutes
- Full Split/Merge Thresholding & Merge Tracking:  ~40 minutes

In [1]:
from getpass import getuser
from pathlib import Path

import dask
import xarray as xr

import marEx
import marEx.helper as hpc

In [2]:
# Lustre Scratch Directory
scratch_dir = Path("/scratch") / getuser()[0] / getuser()

In [3]:
# Start Dask Cluster
client = hpc.start_local_cluster(
    n_workers=32, threads_per_worker=1, scratch_dir=scratch_dir / "clients"
)  # Specify temporary scratch directory for dask to use

Hostname: l40183
Forward Port: l40183:8787
Dashboard Link: localhost:8787/status


In [4]:
# Choose optimal chunk size & load data
#   N.B.: This is crucial for dask (not only for performance, but also to make the problem tractable)
#         The operations are eventually global-in-space, and so requires the spatial dimension to be contiguous/unchunked
#         We can adjust the chunk size in time depending on available system memory.

chunk_size = {"time": 25, "lat": -1, "lon": -1}

In [5]:
# Load Pre-processed Data (cf. `01_preprocess_extremes.ipynb`)

file_name = scratch_dir / "mhws" / "extremes_binary_gridded_shifting_hobday.zarr"
ds = xr.open_zarr(str(file_name), chunks=chunk_size)

In [6]:
# Run ID, Tracking, & Merging

tracker = marEx.tracker(
    ds.extreme_events,
    ds.mask.where(
        (ds.lat < 85) & (ds.lat > -90), other=False
    ),  # Modify Mask: Anisotropy of the lat/lon grid near the poles biases the ID & Tracking
    grid_resolution=0.25,  # Grid resolution in degrees, used to calculate the object areas on the globe
    area_filter_absolute=600,  # Remove objects smaller than 600 cells
    R_fill=12,  # Fill small holes with radius < 12 _cells_
    T_fill=4,  # Allow gaps of 4 days and still continue the event tracking with the same ID
    allow_merging=True,  # Allow extreme events to split/merge. Keeps track of merge events & unique IDs.
    overlap_threshold=0.25,  # Overlap threshold for merging events. If overlap > threshold, events merge, are partitioned, and are independently tracked
    nn_partitioning=True,  # Use new NN method to partition merged children areas. If False, reverts to old method of Di Sun et al. 2023.
    verbose=True
)

extreme_events_ds, merges_ds = tracker.run(return_merges=True)
extreme_events_ds

Tracking Statistics:
   Binary Hobday to Processed Area Fraction: 0.38775179242158936
   Total Object Area IDed (cells): 751003681.0
   Number of Initial Pre-Filtered Objects: 174850
   Number of Final Filtered Objects: 131687
   Area Cutoff Threshold (cells): 600
   Accepted Area Fraction: 0.9862115802332532
   Total Events Tracked: 9291
   Total Merging Events Recorded: 38017


Unnamed: 0,Array,Chunk
Bytes,35.85 GiB,3.96 MiB
Shape,"(9282, 720, 1440)","(1, 720, 1440)"
Dask graph,9282 chunks in 4 graph layers,9282 chunks in 4 graph layers
Data type,int32 numpy.ndarray,int32 numpy.ndarray
"Array Chunk Bytes 35.85 GiB 3.96 MiB Shape (9282, 720, 1440) (1, 720, 1440) Dask graph 9282 chunks in 4 graph layers Data type int32 numpy.ndarray",1440  720  9282,

Unnamed: 0,Array,Chunk
Bytes,35.85 GiB,3.96 MiB
Shape,"(9282, 720, 1440)","(1, 720, 1440)"
Dask graph,9282 chunks in 4 graph layers,9282 chunks in 4 graph layers
Data type,int32 numpy.ndarray,int32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,328.94 MiB,36.29 kiB
Shape,"(9282, 9290)","(1, 9290)"
Dask graph,9282 chunks in 3 graph layers,9282 chunks in 3 graph layers
Data type,int32 numpy.ndarray,int32 numpy.ndarray
"Array Chunk Bytes 328.94 MiB 36.29 kiB Shape (9282, 9290) (1, 9290) Dask graph 9282 chunks in 3 graph layers Data type int32 numpy.ndarray",9290  9282,

Unnamed: 0,Array,Chunk
Bytes,328.94 MiB,36.29 kiB
Shape,"(9282, 9290)","(1, 9290)"
Dask graph,9282 chunks in 3 graph layers,9282 chunks in 3 graph layers
Data type,int32 numpy.ndarray,int32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,328.94 MiB,36.29 kiB
Shape,"(9282, 9290)","(1, 9290)"
Dask graph,9282 chunks in 4 graph layers,9282 chunks in 4 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 328.94 MiB 36.29 kiB Shape (9282, 9290) (1, 9290) Dask graph 9282 chunks in 4 graph layers Data type float32 numpy.ndarray",9290  9282,

Unnamed: 0,Array,Chunk
Bytes,328.94 MiB,36.29 kiB
Shape,"(9282, 9290)","(1, 9290)"
Dask graph,9282 chunks in 4 graph layers,9282 chunks in 4 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,657.88 MiB,36.29 kiB
Shape,"(2, 9282, 9290)","(1, 1, 9290)"
Dask graph,18564 chunks in 7 graph layers,18564 chunks in 7 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 657.88 MiB 36.29 kiB Shape (2, 9282, 9290) (1, 1, 9290) Dask graph 18564 chunks in 7 graph layers Data type float32 numpy.ndarray",9290  9282  2,

Unnamed: 0,Array,Chunk
Bytes,657.88 MiB,36.29 kiB
Shape,"(2, 9282, 9290)","(1, 1, 9290)"
Dask graph,18564 chunks in 7 graph layers,18564 chunks in 7 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,82.24 MiB,9.07 kiB
Shape,"(9282, 9290)","(1, 9290)"
Dask graph,9282 chunks in 3 graph layers,9282 chunks in 3 graph layers
Data type,bool numpy.ndarray,bool numpy.ndarray
"Array Chunk Bytes 82.24 MiB 9.07 kiB Shape (9282, 9290) (1, 9290) Dask graph 9282 chunks in 3 graph layers Data type bool numpy.ndarray",9290  9282,

Unnamed: 0,Array,Chunk
Bytes,82.24 MiB,9.07 kiB
Shape,"(9282, 9290)","(1, 9290)"
Dask graph,9282 chunks in 3 graph layers,9282 chunks in 3 graph layers
Data type,bool numpy.ndarray,bool numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,72.58 kiB,72.58 kiB
Shape,"(9290,)","(9290,)"
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray
"Array Chunk Bytes 72.58 kiB 72.58 kiB Shape (9290,) (9290,) Dask graph 1 chunks in 3 graph layers Data type datetime64[ns] numpy.ndarray",9290  1,

Unnamed: 0,Array,Chunk
Bytes,72.58 kiB,72.58 kiB
Shape,"(9290,)","(9290,)"
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,72.58 kiB,72.58 kiB
Shape,"(9290,)","(9290,)"
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray
"Array Chunk Bytes 72.58 kiB 72.58 kiB Shape (9290,) (9290,) Dask graph 1 chunks in 3 graph layers Data type datetime64[ns] numpy.ndarray",9290  1,

Unnamed: 0,Array,Chunk
Bytes,72.58 kiB,72.58 kiB
Shape,"(9290,)","(9290,)"
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,datetime64[ns] numpy.ndarray,datetime64[ns] numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.82 GiB,544.34 kiB
Shape,"(9282, 9290, 15)","(1, 9290, 15)"
Dask graph,9282 chunks in 4 graph layers,9282 chunks in 4 graph layers
Data type,int32 numpy.ndarray,int32 numpy.ndarray
"Array Chunk Bytes 4.82 GiB 544.34 kiB Shape (9282, 9290, 15) (1, 9290, 15) Dask graph 9282 chunks in 4 graph layers Data type int32 numpy.ndarray",15  9290  9282,

Unnamed: 0,Array,Chunk
Bytes,4.82 GiB,544.34 kiB
Shape,"(9282, 9290, 15)","(1, 9290, 15)"
Dask graph,9282 chunks in 4 graph layers,9282 chunks in 4 graph layers
Data type,int32 numpy.ndarray,int32 numpy.ndarray


In [7]:
merges_ds

In [8]:
# Save IDed/Tracked/Merged Events to `zarr` for more efficient parallel I/O

file_name = scratch_dir / "mhws" / "extreme_events_merged_gridded_shifting.zarr"
extreme_events_ds.to_zarr(file_name, mode="w")

<xarray.backends.zarr.ZarrStore at 0x154dd1fa01f0>

In [9]:
# Save Merges Dataset to netcdf
file_name = scratch_dir / "mhws" / "extreme_events_merged_gridded_shifting_merges.nc"
merges_ds.to_netcdf(file_name, mode="w")

### Run Basic Tracking for Comparison
N.B.: This is the current standard method used in the literature, which involves _No_ temporal gap filling, _No_ merging/splitting and _No_ independent event tracking.

In [10]:
# Run Basic Tracking

tracker = marEx.tracker(
    ds.extreme_events,
    ds.mask.where(
        (ds.lat < 85) & (ds.lat > -90), other=False
    ),  # Modify Mask: Anisotropy of the lat/lon grid near the poles biases the ID & Tracking
    area_filter_absolute=600,  # Remove objects smaller than 600 cells
    R_fill=12,  # Fill small holes with radius < 12 _cells_
    T_fill=0,  # No temporal hole filling
    allow_merging=False,  # Do not allow extreme events to split/merge. All touching events adopt the same ID forever (after _and_ before (!)).
)

extreme_events_basic_ds = tracker.run()
extreme_events_basic_ds

Tracking Statistics:
   Binary Hobday to Processed Area Fraction: 0.4526543028305935
   Total Object Area IDed (cells): 645136521.0
   Number of Initial Pre-Filtered Objects: 187801
   Number of Final Filtered Objects: 136343
   Area Cutoff Threshold (cells): 600
   Accepted Area Fraction: 0.983396886315788
   Total Events Tracked: 13186


Unnamed: 0,Array,Chunk
Bytes,35.85 GiB,3.96 MiB
Shape,"(9282, 720, 1440)","(1, 720, 1440)"
Dask graph,9282 chunks in 4 graph layers,9282 chunks in 4 graph layers
Data type,int32 numpy.ndarray,int32 numpy.ndarray
"Array Chunk Bytes 35.85 GiB 3.96 MiB Shape (9282, 720, 1440) (1, 720, 1440) Dask graph 9282 chunks in 4 graph layers Data type int32 numpy.ndarray",1440  720  9282,

Unnamed: 0,Array,Chunk
Bytes,35.85 GiB,3.96 MiB
Shape,"(9282, 720, 1440)","(1, 720, 1440)"
Dask graph,9282 chunks in 4 graph layers,9282 chunks in 4 graph layers
Data type,int32 numpy.ndarray,int32 numpy.ndarray


In [11]:
# Save IDed Events to `zarr` for more efficient parallel I/O

file_name = scratch_dir / "mhws" / "extreme_events_basic_gridded_shifting.zarr"
extreme_events_basic_ds.to_zarr(file_name, mode="w")

<xarray.backends.zarr.ZarrStore at 0x154d41f83d00>