## Accessing zarr-formatted Daymet data on Azure

The Daymet dataset contains daily minimum temperature, maximum temperature, precipitation, shortwave radiation, vapor pressure, snow water equivalent, and day length at 1km resolution for North America. The dataset covers the period from January 1, 1980 to December 31, 2019.

The Daymet dataset is maintained at [daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=1328](daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=1328) and mirrored on Azure Open Datasets at [aka.ms/ai4edata-daymet](aka.ms/ai4edata-daymet). Azure also provides a cloud-optimized version of the data in [Zarr](https://zarr.readthedocs.io/en/stable/) format, which can be read into an [xarray](http://xarray.pydata.org/en/stable/) [Dataset](http://xarray.pydata.org/en/stable/data-structures.html#dataset). If you just need a subset of the data, we recommend using xarray and Zarr to avoid downloading the full dataset unnecessarily.

The datasets are available in the `daymeteuwest` storage account, in the `daymet-zarr` container.  Files are named according to `daymet-zarr/{frequency}/{region}.zarr`, where frequency is one of `{daily, monthly, annual}` and region is one of `{hi, na, pr}` (for Hawaii, CONUS, and Puerto Rico, respectively). For example, `daymet-zarr/daily/hi.zarr`.

In [None]:
# Standard or standard-ish imports
import warnings
import matplotlib.pyplot as plt

# Less standard, but still pip- or conda-installable
import xarray as xr
import fsspec
from dask.distributed import Client

# Neither of these are accessed directly, but both need to be installed; they're used
# via fsspec
import adlfs
import zarr

In [None]:
account_name = 'daymeteuwest'
container_name = 'daymet-zarr'

### Load data into an xarray Dataset

We can lazily load the data into an `xarray.Dataset` by creating a zarr store with [fsspec](https://filesystem-spec.readthedocs.io/en/latest/) and then reading it in with xarray. This only reads the metadata, so it's safe to call on a dataset that's larger than memory.

In [None]:
store = fsspec.get_mapper('az://' + container_name + '/monthly/na.zarr', account_name=account_name)
# consolidated=True speeds of reading the metadata
ds = xr.open_zarr(store, consolidated=True)
ds

#### Compute the average monthly tmin, tmax and precip for the period 1990-2019

In [None]:
dss = ds.sel(time=slice('1990','2019'))  
dss

In [None]:
dss['prcp'].nbytes/1e9

In [None]:
import os
import sys
sys.path.append(os.path.join(os.environ['HOME'],'shared','users','lib'))
import ebdpy as ebd

?ebd.start_dask_cluster

In [None]:
profile = 'esip-qhub'
region = 'us-west-2'
endpoint = f's3.{region}.amazonaws.com'
ebd.set_credentials(profile=profile, region=region, endpoint=endpoint)
worker_max = 60
client,cluster = ebd.start_dask_cluster(profile=profile,worker_max=worker_max, 
                                      region=region, use_existing_cluster=True,
                                      adaptive_scaling=False, wait_for_cluster=False, 
                                      environment='pangeo', worker_profile='Pangeo Worker', propagate_env=True)

In [None]:
cluster

#### Store result in Zarr

In [None]:
import fsspec
fs = fsspec.filesystem('s3', anon=False)

In [None]:
store = fs.get_mapper('s3://esip-qhub/usgs/daymet3.zarr')

In [None]:
%%time
#import dask
##import dask.array as da
#with dask.annotate(retries=3):

d_ave = dss[['prcp', 'tmin', 'tmax']].groupby('time.month').mean(dim='time')

In [None]:
d_ave

In [None]:
from dask.distributed import performance_report

with performance_report(filename="dask-report.html"):
    f = client.compute(d_ave.to_zarr(store, mode='w', consolidated=True, compute=False), retries=20)

In [None]:
client.close();cluster.shutdown()