This notebook calculates marine heatwaves using the code in Eric Oliver's marineHeatwaves repository.

The code works but it works on only a single pixel at a time, or on a moderately sized block in a manual loop. I wrote a loop to calculate many pixels but it quickly becomes slow.  The results of the benchmarking for MUR are below.

In [1]:
from datetime import datetime 

import fsspec
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
import dask.array as da
import marineHeatWaves as mhw

In [None]:
import s3fs

## Load & Subset MUR

In [None]:
# Block: LOAD ZARR (no task activity)
file_location = 's3://mur-sst/zarr'

ikey = fsspec.get_mapper(file_location, anon=True)

mur_full = xr.open_zarr(ikey, consolidated=True)
mur = mur_full['analysed_sst']

In [None]:
# Block: SUBSET
# 4 chunk subset, ~110 MB total
mur_subset = mur.sel(lat=slice(32, 32.4), lon=slice(123.0, 123.2))

In [None]:
mur_subset

## Oliver MHW code

### Preprocessing

In [None]:
# Format time values
time_dt_list = [datetime.strptime(str(time), '%Y-%m-%dT%H:%M:%S.000000000') for time in mur_subset.time.values]
time_ordinal = np.array([time.toordinal() for time in time_dt_list])


In [None]:
# Extract sst as a numpy array
sst_np = mur_subset.values

### Calculating an individual pixel

Example of how to use this code on a single point.

In [None]:
mhws, clim = mhw.detect(time_ordinal, sst_np[:, 2, 6])

In [None]:
clim

## Calculating a block of data

In [None]:
# Get number of pixels in each dimension
size_t, size_lat, size_lon = sst_np.shape

# Create empty arrays to hold the outputs
full_climatology = np.empty(sst_np.shape)
full_threshold = np.empty(sst_np.shape)

# loop through each pixel in the sst array
# numpy indexes row, col starting from the upper left
for idx_lat in range(size_lat):
    for idx_lon in range(size_lon):
        # Calculate MHW stats for that pixel
        mhws, point_clim = mhw.detect(time_ordinal, sst_np[:, idx_lat, idx_lon])
        # Add the climatology and threshold to the output arrays
        full_climatology[:, idx_lat, idx_lon] = point_clim['seas']
        full_threshold[:, idx_lat, idx_lon] = point_clim['thresh']


### Converting the data back to `xarray`

Use the same lat, lon, and time dimensions/coordinates as the original `mur_subset` data.

In [None]:
climatology = xr.DataArray(full_climatology, coords = mur_subset.coords, dims = mur_subset.dims)
threshold = xr.DataArray(full_threshold, coords = mur_subset.coords, dims = mur_subset.dims)


In [None]:
climatology

In [None]:
climatology.isel(time=0).plot()

In [None]:
# Anomaly
(mur_subset - threshold).isel(time=0).plot()

# Extraneous other notes

### What is 'seas' returning?

Looks like the same array (1 year of data) repeated multiple times over to match the size of the time array of the input.

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.plot(time_ordinal, point_clim['seas'])

## Some loose timing
A) `mur.sel(lat=slice(32, 32.4), lon=slice(123.0, 123.2))`
* `.values` extract ~20 seconds
* loop took ~90 seconds

B) `mur.sel(lat=slice(32, 33), lon=slice(123.0, 124))` (33,187,893 pixels (`.size`))
* `.values` extract ~4 seconds ??? (shorter than previous?)
* loop took ~9.5 minutes (accidently did this on with Dask on, but I don't think that makes a difference)

C) `mur.sel(lat=slice(32, 33), lon=slice(123.0, 124))` (65,725,043 pixels (`.size`))
* `.values` extract ~4 seconds
* loop took ~19 minutes
* Notes: memory seems to stay at 0.2%; CPU does max out (100% of a worker)

D) Estimate for 3.5*2.25 degrees:
`mur.sel(lat=slice(32, 35.5), lon=slice(123.0, 125.25))` (511,097,418 (`.size`))
* `.values` extract 32 seconds (it did fit into memory :tada:)
* estimate for loop: 8 times longer than C -- 160 minutes