This notebook starts to implement the Oliver parellized loop from the `ejoliver_loop_randomdata.ipynb` notebook with Geopolar SST data.

The first step of this process, opening the data, ended up resulting in a lot of timing.

The timing written here is all from runs on the department server. Runs across different platforms are compared in an excel sheet in the drive.

In [1]:
from datetime import datetime

import numpy as np
import xarray as xr
import marineHeatWaves as mhw

# AOI: Gulf Stream

## Size Stats
- `geopolar.analysed_sst.sel(lat=slice(32, 53), lon=slice(-79, -42))`
- Num pixels: 2157321600 (2.2 billion)
- Num bytes: 8.63 GB

# Data Processing & Access

## Accessing data with `zarr`

In [38]:
from datetime import datetime

import numpy as np
import dask.array as da

In [17]:
geopolar = da.from_zarr(filepath, component='/analysed_sst')
# time = da.from_zarr(filepath, component='/time')

In [22]:
time = time.compute()

Something is going on when you read `/time` component with dask straight from zarr. Values come back as indexes, not datetime strings.

In [29]:
time

array([   0,    1,    2, ..., 7138, 7139, 7140])

In [36]:
time = geopolar.time.values

In [39]:
ordinal_time = format_time(time)

In [16]:
onedeg_sst = geopolar[:, 2600:2620, 2200: 2220]

Unnamed: 0,Array,Chunk
Bytes,10.89 MiB,3.12 kiB
Shape,"(7134, 20, 20)","(2, 20, 20)"
Count,10702 Tasks,3567 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 10.89 MiB 3.12 kiB Shape (7134, 20, 20) (2, 20, 20) Count 10702 Tasks 3567 Chunks Type float32 numpy.ndarray",20  20  7134,

Unnamed: 0,Array,Chunk
Bytes,10.89 MiB,3.12 kiB
Shape,"(7134, 20, 20)","(2, 20, 20)"
Count,10702 Tasks,3567 Chunks
Type,float32,numpy.ndarray


**STATUS:** run again on a smaller chunk. I'm getting an error in the mhw function using real data.

In [42]:
data = onedeg_sst

# define a wrapper to rearrange arguments
def func1d_climatology(arr, time):
   _, point_clim = mhw.detect(time, arr)
   # return climatology
   return point_clim['seas']

# define a wrapper to rearrange arguments
def func1d_threshold(arr, time):
   _, point_clim = mhw.detect(time, arr)
   # return threshold
   return point_clim['thresh']

# output arrays
full_climatology = da.zeros_like(data)
full_threshold = da.zeros_like(data)

climatology = da.apply_along_axis(func1d_climatology, 0, data, time=time, dtype=data.dtype, shape=(7134,))
threshold = da.apply_along_axis(func1d_threshold, 0, data, time=time, dtype=data.dtype, shape=(7134,))


In [None]:
climatology = climatology.compute()

## Accessing data with `xarray`

In [2]:
filepath = 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/noaa-coastwatch-geopolar-sst.zarr'
geopolar = xr.open_zarr(filepath)
geopolar = geopolar.analysed_sst

In [9]:
min_lat, max_lat, min_lon, max_lon = (32, 53, -79, -42)
geopolar = geopolar.sel(lat=slice(min_lat, max_lat), lon=slice(min_lon, max_lon))
# geopolar = geopolar.sel(lat=slice(40, 41), lon=slice(-70, -69))

In [11]:
geopolar.size

2157321600

## Preprocessing

In [42]:
def format_time(time_np):
    # Format time values
    time_dt_list = [datetime.strptime(str(time), '%Y-%m-%dT%H:%M:%S.000000000') for time in time_np]
    return np.array([time.toordinal() for time in time_dt_list])

In [43]:
time_ordinal = format_time(geopolar.time.values)

In [62]:
sst_np = geopolar.values

In [74]:
sst_np.shape

(7134, 20, 20)

In [68]:
%%time

# Get number of pixels in each dimension
size_t, size_lat, size_lon = sst_np.shape

# Create empty arrays to hold the outputs
full_climatology = np.empty(sst_np.shape)
full_threshold = np.empty(sst_np.shape)

# loop through each pixel in the sst array
# numpy indexes row, col starting from the upper left
for idx_lat in range(size_lat):
    for idx_lon in range(size_lon):
        # Calculate MHW stats for that pixel
        mhws, point_clim = mhw.detect(time_ordinal, sst_np[:, idx_lat, idx_lon])
        # Add the climatology and threshold to the output arrays
        full_climatology[:, idx_lat, idx_lon] = point_clim['seas']
        full_threshold[:, idx_lat, idx_lon] = point_clim['thresh']

  mhw['rate_decline'].append((mhw_relSeas[tt_peak] - mhw_relSeas[-1]) / (tt_end-tt_start-tt_peak))


CPU times: user 43.9 s, sys: 0 ns, total: 43.9 s
Wall time: 43.9 s


In [83]:
write_test = geopolar.sel(lat=slice(40, 40.1), lon=slice(-70, -69.8))

In [85]:
write_test

Unnamed: 0,Array,Chunk
Bytes,222.94 kiB,64 B
Shape,"(7134, 2, 4)","(2, 2, 4)"
Count,14269 Tasks,3567 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 222.94 kiB 64 B Shape (7134, 2, 4) (2, 2, 4) Count 14269 Tasks 3567 Chunks Type float32 numpy.ndarray",4  2  7134,

Unnamed: 0,Array,Chunk
Bytes,222.94 kiB,64 B
Shape,"(7134, 2, 4)","(2, 2, 4)"
Count,14269 Tasks,3567 Chunks
Type,float32,numpy.ndarray


In [84]:
write_test.to_dataset().to_zarr('./test.zarr')

<xarray.backends.zarr.ZarrStore at 0x7f35348d0200>