# Identify & Track Marine Heatwaves on _Unstructured Grid_ using `spot_the_blOb`

## Processing Steps:
1. Fill spatial holes in the binary data, using `dask_image.ndmorph` -- up to `R_fill` cells in radius.
2. Fill gaps in time -- permitting up to `T_fill` missing time slices, while keeping the same blob ID.
3. Filter out small objects -- area less than the bottom `area_filter_quartile` of the size distribution of objects.
4. Identify objects in the binary data, using `dask_image.ndmeasure`.
5. Connect objects across time, applying the following criteria for splitting, merging, and persistence:
    - Connected Blobs must overlap by at least fraction `overlap_threshold` of the smaller blob.
    - Merged Blobs retain their original ID, but partition the child blob based on the parent of the _nearest-neighbour_ cell. 
6. Cluster and reduce the final object ID graph using `scipy.sparse.csgraph.connected_components`.
7. Map the tracked objects into ID-time space for convenient analysis.

N.B.: Exploits parallelised `dask` operations with optimised chunking using `flox` for memory efficiency and speed \
N.N.B.: This example using 40 years of _daily_ outputs at 5km resolution on an Unstructured Grid (15 million cells) using 32 cores takes 
- Full Split/Merge Thresholding & Merge Tracking:  ~40 minutes

In [None]:
import xarray as xr
import dask
from getpass import getuser
from pathlib import Path

import spot_the_blOb as blob
import spot_the_blOb.helper as hpc

In [None]:
# Start Dask Cluster
client = hpc.StartLocalCluster(n_workers=32, n_threads=1)

2025-02-11 13:08:41,768 - distributed.scheduler - ERROR - Task ('xarray-<this-array>-2182624e3dd2f7458b5b6c1f31bdd0f3', 9, 0) marked as failed because 4 workers died while trying to run it
2025-02-11 13:08:41,769 - distributed.scheduler - ERROR - Task ('time-apply_updates_block-529e3f0f0e8f23fa2d0cdfc46c06eed1-64c534562521a6e34ef8017586da9033', 13) marked as failed because 4 workers died while trying to run it
2025-02-11 13:08:41,787 - distributed.scheduler - ERROR - Task ('xarray-<this-array>-2182624e3dd2f7458b5b6c1f31bdd0f3', 1, 0) marked as failed because 4 workers died while trying to run it
2025-02-11 13:08:41,805 - distributed.scheduler - ERROR - Task ('cumsum-33d5cd1029324ec918abc9399a5554d3', 'extra', 0) marked as failed because 4 workers died while trying to run it
2025-02-11 13:08:41,806 - distributed.scheduler - ERROR - Task ('xarray-<this-array>-2182624e3dd2f7458b5b6c1f31bdd0f3', 6, 0) marked as failed because 4 workers died while trying to run it
2025-02-11 13:08:42,427 - 

In [None]:
# Load Pre-processed Data (cf. `01_preprocess_extremes.ipynb`)

file_name = Path('/scratch') / getuser()[0] / getuser() / 'mhws' / 'extreme_events_binary_unstruct.zarr'
chunk_size = {'time': 4, 'ncells': -1}
ds = xr.open_zarr(str(file_name), chunks={}).isel(time=slice(0, 64)).chunk(chunk_size)

In [None]:
# Tracking Parameters

drop_area_quartile = 0.8  # Remove the smallest 80% of the identified blobs
hole_filling_radius = 32  # Fill small holes with radius < 32 elements, i.e. ~100 km
time_gap_fill = 2         # Allow gaps of 2 days and still continue the blob tracking with the same ID
allow_merging = True      # Allow blobs to split/merge. Keeps track of merge events & unique IDs.
overlap_threshold = 0.5   # Overlap threshold for merging blobs. If overlap < threshold, blobs keep independent IDs.
nn_partitioning = True    # Use new NN method to partition merged children blobs. If False, reverts to old method of Di Sun et al. 2023.

In [None]:
# SpOt & Track the Blobs & Merger Events

tracker = blob.Spotter(ds.extreme_events, ds.mask, R_fill=hole_filling_radius, T_fill = time_gap_fill, area_filter_quartile=drop_area_quartile, 
                       allow_merging=allow_merging, overlap_threshold=overlap_threshold, nn_partitioning=nn_partitioning, 
                       xdim='ncells',                 # Need to tell spot_the_blOb the new Unstructured dimension
                       unstructured_grid=True,        # Use Unstructured Grid
                       neighbours=ds.neighbours,      # Connectivity array for the Unstructured Grid Cells
                       cell_areas=ds.cell_areas,      # Cell areas for each Unstructured Grid Cell
                       debug=0,                       # Choose Debugging Level (max=2)
                       verbosity=3)                   # Choose Verbosity Level (0=None, 1=Basic, 2=Timing)

blobs = tracker.run(return_merges=False)

blobs

In [None]:
blobs = blobs.compute() 

In [None]:
# Save Tracked Blobs to `zarr` for more efficient parallel I/O

file_name = Path('/scratch') / getuser()[0] / getuser() / 'mhws' / 'MHWs_tracked_unstruct.zarr'
blobs.to_zarr(file_name, mode='w')

2025-02-11 13:08:42,499 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.
Traceback (most recent call last):
  File "/home/b/b382615/opt/anaconda3/lib/python3.10/site-packages/distributed/comm/tcp.py", line 225, in read
    frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/b/b382615/opt/anaconda3/lib/python3.10/site-packages/distributed/worker.py", line 1269, in heartbeat
    response = await retry_operation(
  File "/home/b/b382615/opt/anaconda3/lib/python3.10/site-packages/distributed/utils_comm.py", line 441, in retry_operation
    return await retry(
  File "/home/b/b382615/opt/anaconda3/lib/python3.10/site-packages/distributed/utils_comm.py", line 420, in retry
    return await coro()
  File "/home/b/b382615/opt/anaconda3/lib/python3.10/site-packages/distributed/c