# Pre-Process global _daily_ SST on _Native_ 5km Unstructured Grid using `MarEx Detect` to extract binary features

## Steps:
1. Compute Normalised Detrended Anomaly (cf. `detect.py::compute_normalised_anomaly()`)
2. Identify Values Locally Exceeding the Prescribed Percentile Threshold (i.e. above 95th percentile) using new histogram-based parallel quantile calculation

N.B.: Exploits parallelised `Dask` operations with optimised chunking using `flox` \
N.N.B.: This example using 40 years of Daily outputs at 5km resolution on an _Unstructured Grid_ (15 million cells) takes ~10 minutes on 512 cores

In [None]:
import xarray as xr
import dask
import intake
from getpass import getuser
from pathlib import Path

import spot_the_blOb.hot_to_blOb as hot
import spot_the_blOb.helper as hpc

In [None]:
# Start Distributed Dask Cluster
client = hpc.StartDistributedCluster(n_workers=512, workers_per_node=64, runtime=20, node_memory=512)

In [None]:
# Import 40 years of Daily Native-Grid ICON data (ref. EERIE project)

cat = intake.open_catalog("https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/eerie.yaml")
expid = 'eerie-control-1950'
version = 'v20240618'
model = 'icon-esm-er'
gridspec = 'native'

dat = cat['dkrz.disk.model-output'][model][expid][version]['ocean'][gridspec]

In [None]:
# Load the data directly into larger time chunks

sst = dat['2d_daily_mean'](chunks={}).to_dask().to.isel(depth=0).chunk({'time':200,'ncells':'auto'})
sst

In [None]:
# Load the grid & neighbours

grid2d = dat['2d_grid'](chunks={}).to_dask().rename({'cell':'ncells'})
neighbours = grid2d.neighbor_cell_index.rename({'clat':'lat', 'clon':'lon'})
areas = grid2d.cell_area.rename({'clat':'lat', 'clon':'lon'})

In [None]:
# Process Data using `MarEx Detect` helper functions:

extremes_ds = hot.preprocess_data(sst, 
                                std_normalise = False,            # Don't Standardise the data (this is standard)
                                threshold_percentile = 95,        # Use the 95th percentile as the extremes threshold
                                exact_percentile = False,         # Use a histogram-based method to estimate the percentile value (within 0.025C)
                                dask_chunks = {'time': 2},        # Need to use smaller chunks in time to account for larger amount of spatial data
                                neighbours = neighbours,          # Pass information about neighbours to be used in subsequent processing
                                cell_areas = areas,               # Pass information about each Unstructured Grid's cell area (in metres) to be used in subsequent processing
                                dimensions = {'time':'time', 
                                              'xdim':'ncells'})   # Not specifying 'ydim' tells MarEx that it is an Unstructured Grid
extremes_ds

In [None]:
# Save data to `zarr` for more efficient paralledl I/O
file_name = Path('/scratch') / getuser()[0] / getuser() / 'mhws' / 'extremes_binary_unstruct.zarr'
extremes_ds.to_zarr(file_name, mode='w')