# Pre-Process global _daily_ SST on _Native_ 5km Unstructured Grid using `hot_to_blOb` to extract binary features

## Steps:
1. Compute Normalised Detrended Anomaly (cf. `hot_to_blOb.py::compute_normalised_anomaly()`)
2. Identify Extreme Values (i.e. above 95th percentile) using new histogram-based parallel quantile calculation

N.B.: Exploits parallelised `Dask` operations with optimised chunking using `flox` \
N.N.B.: This example using 40 years of Daily outputs at 5km resolution on an Unstructured Grid (15 million cells) takes ~10 minutes on 512 cores

In [1]:
import xarray as xr
import dask
import intake
import numpy as np
from getpass import getuser
from pathlib import Path

import spot_the_blOb.hot_to_blOb as hot
import spot_the_blOb.helper as hpc

In [2]:
# Start Distributed Dask Cluster
client = hpc.StartDistributedCluster(n_workers=512, workers_per_node=64, runtime=20, node_memory=512)

In [3]:
# Import 40 years of Daily EERIE ICON data

cat = intake.open_catalog("https://raw.githubusercontent.com/eerie-project/intake_catalogues/main/eerie.yaml")
expid = 'eerie-control-1950'
version = 'v20231106'
model = 'icon-esm-er'
gridspec = 'native'

dat = cat['dkrz.disk.model-output'][model][expid][version]['ocean'][gridspec]

In [4]:
# Load the data directly into slightly larger time chunks
sst = dat['2d_daily_mean'](chunks={'time':200,'ncells':'auto'}).to_dask().to.isel(depth=0)
sst

In [5]:
# Load the grid & neighbours
grid2d = dat['2d_grid'](chunks={}).to_dask().rename({'cell':'ncells'})
neighbours = grid2d.neighbor_cell_index.rename({'clat':'lat', 'clon':'lon'})

In [None]:
# Process Data using `hot_to_blOb` helper functions:

extreme_events_ds = hot.preprocess_data(sst, std_normalise=False, threshold_percentile=95, 
                                        exact_percentile=False,                         # Use a histogram-based method to estimate the percentile value (within 0.025C)
                                        dask_chunks={'time':2},                         # Need to use smaller chunks in time to account for larger spatial dimension
                                        neighbours = neighbours,                        # Pass information about neighbours to be used in subsequent processing
                                        dimensions={'time':'time', 'xdim':'ncells'})    # Not specifying 'ydim' tells hot_to_blOb that it is an unstructured grid
extreme_events_ds

In [None]:
# Save data to `zarr` for more efficient paralledl I/O
file_name = Path('/scratch') / getuser()[0] / getuser() / 'mhws' / 'extreme_events_binary_unstruct.zarr'
encoding = {var: {'compressor': None} for var in extreme_events_ds.data_vars}
extreme_events_ds.to_zarr(file_name, mode='w', encoding=encoding)